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INTRODUCTION 


• 

^  This  report  compiles  the  results  of  tbe^' research- performed  under  U.S.  Army 
Research  Office  Contract  Number  DAAG29-82-K-0I01,  covering  the  period  April  1, 
1982  through  September  31,  1985.  The  work  is  in  the  area  of  modeling  asynchronous 
parallel  architectures  and  computation  for  applications  in  the  areas  of  digital  image  and 
signal  processing.  The  work  can  be  broadly  divided  into  three  areas:  - 

1.  Case  studies  of  parallel  image  processing  algorithms  and  tasks,  the  objective  of 
which  is  to  study  the  interaction  of  parallel  processes  and  parallel  architectures^ 
These  are  described  in  Papers  1  through  5  and  in  portions  of  Appendices  A,  B,  C, 
and  D. 

2.  Modeling  of  interconnection  networks.  An  important  component  of  any  large-scale 
distributed  system  is  the  interconnection  network.  Different  techniques  for  model¬ 
ing  interconnection  networks  were  developed  and  are  described  in  Papers  6 
through  9  and  in  portions  of  Appendices  A,  C,  and  E. 

3.  Aspects  of  the  problem  of  modeling  parallel  processes  and  parallel  architectures. 
This  includes  mechanisms  for  describing  MIMD  algorithms  (Paper  11  and  portions 
of  Appendices  A  and  D),  application  of  a  Petri  net  based  modeling  scheme  to 
SIMD  and  pipeline  implementations  of  example  image  processing  algorithms  (por¬ 
tions  of  Appendices  A  and  F),  consideration  of  performance  criteria  for  parallel 
image  processing  algorithms  (portions  of  Appendix  F),  matching  algorithms  with 
macropipelined  distributed  systems  (Paper  12  and  portions  of  Appendix  G),  new 
models  for  the  organization  of  distributed  systems  comprised  of  collections  of  spe¬ 
cial  purpose  computing  devices  (Paper  10),  and  companion  features  for  describing 
parallel  processes  and  parallel  architectures  (portions  of  Appendices  A  and  D). 
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PARALLEL  ALGORITHMS  FOR  COMPUTER  VISION 
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Abstract 

An  application  of  parallel  processing  to  the  computa¬ 
tionally  intensive  task  of  computer  vision  is  presented. 
Computational  speedups,  both  theroretical  and  experi¬ 
mental,  are  derived  and  presented  for  the  extraction  of 
several  parameters  based  upon  the  SRI  vision  module  and 
Fourier  descriptors.  Good  results  are  obtained  for 
moderate  numbers  of  processing  elements.  The  use  of 
parallel  processing  allows  easier  expansion  and 
modification  of  the  vision  algorithms  as  compared  with  a 
hardware  approach. 

1.  Introduction 

Parallel  processing  offers  the  potential  of  providing 
fast,  flexible  solutions  to  many  computationally  intensive 
tasks.  In  this  paper,  the  use  of  parallelism  for  computer 
vision  is  described.  Theoretical  analyses  and  simulation 
results  are  presented.  Considerations  for  the  design  of  a 
parallel  architecture  for  computer  vision  are  discussed. 

2.  Definitions  for  the  Parallel  Slmnlntlon 

In  this  section,  two  general  models  of  parallel  cona- 
putatioD  are  defined  and  the  specific  model  used  for  the 
computer  vision  task  is  presented.  The  implementation 
of  the  parallel  simulation  is  described. 

Model 

Sitifle  in$lneUon  stream  •  multiple  data  stream 
(SIMD)  machines  (t)  represent  a  form  of  synchronous, 
highly  parallel  processing.  Systems  with  up  to  1,000  full 
processors  have  been  proposed  |10,  H|;  systems  with  as 
many  as  0,000  and  16,000  simple  processors  have  been 
built  |2,  3j.  An  SIMD  machine  typically  consbts  of  a 
control  unti,  a  set  of  P  processing  elements  (PEs),  each  a 
processor  with  its  own  memory,  and  an  interconnection 
network.  The  control  unit  broadcasts  instructions  to  ail 
PEs  and  each  active  PE  executes  the  instruction  on  the 
data  in  its  own  memory.  The  interconnection  network 
allows  data  to  be  transferred  among  the  PEs.  SIMD 
machines  are  especially  well-suited  for  exploiting  the 
parallelism  inherent  in  certain  tasks  performed  on  vectors 
and  arrays. 

Multiple  instruction  stream  -  multiple  data  stream 
(MIMD)  machines  |4]  represent  asynchronous  parallel  pro¬ 
cessing.  MIMD  systems  with  16  (18|  and  50  (16]  proces¬ 
sors  have  been  built;  MIMD  systems  with  as  many  as 
4,000  processors  (6]  have  been  proposed.  An  MIMD 
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machine  typically  consists  of  P  processors  and  M 
memories,  M  >  P,  where  each  processor  can  follow  aa 
independent  instruction  stream.  As  with  SIMD  machines, 
there  is  a  multiple  data  stream  and  an  interconnectioB 
network.  Thus,  there  are  P  independent  processors  that 
can  communicate  among  themselves.  There  may  be  a 
coordinlor  unit  to  oversee  the  activities  of  the  processors. 

The  parallel  machine  model  assumed  for  the  coos- 
puter  vision  task  consists  of  a  set  of  PEls  under  the 
management  of  a  control  unit.  The  number  of  PEls  is  a 
power  of  two:  N=2‘'.  Each  of  the  PEs  has  a  unique 
‘‘address'  between  0  and  N  -  1.  In  addition,  there  exbts 
an  interconnection  network  to  allow  the  simultaneous 
transfer  of  data  among  the  PEs.  For  the  computer  vbion 
task,  the  transfer  patterns  required  will  be  uniform 
modulo  shifts  and  “cube"  interconnection  functions.  In 
a  uniform  modulo  shift,  PE  j  transfers  data  to  PE  (j'Fd) 
modulo  N  for  all  j,  0  <  j  <  N,  given  a  positive  or  negar 
live  integer  dbtance  d.  The  value  of  d  may  vary  from 
one  transfer  to  the  next;  however,  for  a  given  transfer,  all 
PEs  will  send  their  data  the  same  dbtance  d.  The  set  of 
cube  interconnection  functions  consbts  of  n  =  logjN  func¬ 
tions,  cubcj,  for  0  <  i  <  n  jl3].  If  Pb-i  ‘  •  P;  ‘  ‘  ‘  Po  “ 
the  binary  representation  of  a  PE's  address,  then  the 
cubcj  function  exchanges  data  between  all  pain  of  PEs 
whose  addresses  differ  in  bit  i: 

cubei(p..|  •  •  Pi  •  •  •  pc)=p,-i  “  Pi  “  Pe 

The  model  assumed  here  combines  SIMD  and  MIMD 
attributes.  Each  PE  will  contain  the  same  code  but  will 
execute  the  code  on  a  different  subimage.  However, 
within  each  PE,  the  code  can  run  in  MIMD  mode.  Thb 
modification  to  the  basic  modeb  allows  faster  execution 
on  some  code  than  a  pure  SIMD  model  without  incurring 
the  expense  of  the  full  flexibility  of  an  MIMD  machine. 
The  gains  in  speed  will  occur  on  the  execution  of  condi¬ 
tional  statements: 

where  <  condition  >  do  <  block  I  > 
ebewhere  do  <block2> 

In  SIMD  mode,  those  PEs  satbfying  the  <condition> 
execute  <blockl>.  Then  the  remaining  PEb  execute 
<b!ock2>.  In  the  model  here,  <  block  1>  and 
<block2>  will  be  executed  concurrently,  but  in  different 
sets  of  PEs.  On  the  other  band,  thb  b  not  full 
mode,  as  it  b  required  that  the  code  in  each  PE  be  the 
same.  Thb  aids  in  insuring  synchronisation  and  thus 
helps  enforce  data  coherence,  e.g.,  insuring  that  a  PE 
acquires  the  correct  version  of  a  variable  from  another 
PE. 

Synchronization  can  take  place  in  one  of  two  ways. 
Fint,  synchronisation  b  required  at  all  data  transfer 
points.  Thb  b  done  because  data  transfers  often  involve 
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the  same  variable  for  all  of  the  PEs.  Thus,  it  does  not 
matter  if  the  separate  processors  take  different  times  to 
execute  their  code,  as  they  will  be  forced  to  synchronize 
at  transfers  to  insure  coherence.  Explicit  synchronization 
will  also  be  possible  by  one  of  the  simulation  language 
constructs  that  requires  that  ail  PEs  finish  a  section  of 
code  before  any  ran  move  to  the  next  section  of  code. 

The  motivation  for  the  assumed  model  comes  about 
from  two  directions.  First,  for  many  image  processing 
operations,  it  is  natural  to  consider  executing  the  same 
code  on  subimages  of  the  original  image.  Each  subimage 
IS  a  valid  image  and  the  same  types  of  operations  are 
needed  on  the  pixels  of  each  subimage.  Second,  since  the 
actual  quantities  of  the  various  operations  that  will  he 
performed  on  each  subimage  may  vary,  asynchronous 
operation  may  allow  higher  PE  utilization. 

Simulation 

There  .are  two  major  approaches  to  the  development 
ot  paralUi  vdtware  Either  the  software  can  be  of  a  gen¬ 
eral!;  !>'«rrii  live  nature  to  illustrate  the  parallelism  (or 
l  icit  (heri't'f)  inherent  in  a  task,  or  the  software  can  be 
designed  to  be  compilable  and  testable,  either  by  parallel 
execution  or  via  serial  simulation.  Due  to  the  computa¬ 
tional  intensity  and  intricacy  of  the  computer  vision  task, 
t'iie  must  reliable  way  to  insure  correctness  is  via  testing. 
This  will  guarantee  that  typical  problem  cases  are  being 
liniiiii.'d  (orreclly  by  testing  the  software  for  a  variety  of 
iiii.sg>-s  A  set  of  lest  images,  some  with  multiple  objects, 
was  used  for  debugging  and  for  analyzing  eomputational 
speedup.  Therefore,  the  software  was  designed  so  that  it 
coil'd  be  compiled  and  tested. 

I'.'ogramming  was  done  in  a  modified  version  of  the 
H'  birgu.ige  i"!  I'his  language  was  chosen  for  the  capa- 
biliHos  ii  t  rovu'iei  for  developing  parallel  data  structures 
S',  !  :h‘  bieh  degree  to  which  one  can  manipulate  system 
li  b  •  iiH'i  ui  (such  as  memory  areas).  The  Utter  played  a 
ii-C"  ,  art  in  ihe  simulating  of  parallel  data  transfers. 

1  ,i<  s-t  .  i'  -  nv.’f:  i  n  of  the  serial  C  language  to  a 
I  a  .li  •  .I'  g'iHgc  w  done  via  macros  and  support  sub- 
I  h'se  fcattire,s  were  designed  to  facilitate  the 
'i.vr;.i;  ■.  „(  parallel  code  without  requiring  the  user 
i.  tn(  w  the  specific  details  of  the  serial  implementation, 
pt.  ,  A.i  simplv  use  the  macro  file  without  knowing 

, -...li.s  ^.li.l  can  .hen  write  parallel  code 

1  hi  '.  Hj  ir  p.iii.tn  of  IhLs  implementation  are  as  fol- 
;  wH  \  si  rii  -t  ,  f  the  form 

1.1  pc  {  civil  h'  lcV  .  5 

.11  !!.•  :  ne|,>:..d  tdo<  k  of  c">de  in  each  of  the  FEs. 

1  .  r.i  "1  i',  '  prepended  to  a  variable  indicates  that 
ihc  var-ahlf  is  ocil  to  a  PE.  All  other  variables  are 
svuo'.’d  to  be  global  (i  e..  the  control  unit  has  one  copy 
t  th-  va:,atiei  (ilohal  variables  are  used  for  such 
operations  a.s  bitip  control  and  overall  conditional  testing. 

I  ii-r'’  an  also  versions  of  the  “in  pe"  con.struct  that 
slli.w  the  code  to  be  executed  in  a  limited  subset  of  the 
i’l.s  i  help  schemes  u.se  an  address  mask  |12],  which  is  a 
.stchiiig  .b  rmal  that  the  PE  address  must  match  for 

utifrn  tt>  f^x'cur  in  that 

Interproccasor  communication  Ls  accomplished  via  a 
“irsnsfp-"  subroutine 

tr8nsfrr(deslmation_address,5ource_addres8,offsel). 

The  transfer  routine  uses  these  addresses  along  with 
information  about  the  size  and  structure  of  the  PE  data 
space  to  simulate  the  transfer  via  a  memory-to- memory 
move  Hcciir'ive  transfers  and  broadcasts  (where  one 


value  is  transferred  to  the  all  of  the  PEs)  are  similar 
Synchronization  is  needed  at  transfer  pointa  to  insure 
data  coherence. 

The  vision  software  and  simulations  were  run  on  a 
dual-processor  Vax  11/780  [5]. 

S.  Overview  of  the  Vlalon  Algorithms 

In  this  section,  an  overview  of  the  computer  vision 
algorithms  is  provided.  The  parameters  described  are 
based  on  the  SRI  vision  module  [Ol  and  Fourier  descrip¬ 
tors  (17) 

A  simple  mechanism  for  entering  an  image  into  the 
system  was  desired.  In  the  method  chosen,  the  user 
employs  a  terminal  with  cursor  control  to  draw 'an  image 
on  the  screen  and  enter  that  image  into  the  data 
memory.  This  section  of  the  code  used  a  small  subseelion 
of  the  “cuises"  |lj  utilities  available  on  the  test  system. 

After  an  image  has  been  entered  into  the  data 
memory,  the  first  l.ask  is  to  classify  the  image.  This  con¬ 
sists  of  transforming  an  image  comprised  of  edge  and 
non-edge  pixels  into  an  image  with  edge,  internal,  and 
external  pixels.  An  internal  pixel  is  a  pixel  that 
represents  a  point  on  an  object,  whereas  an  external  pixel 
represents  a  point  external  to  an  object  (such  as  the 
external  background  or  a  hole  in  the  object). 

.yter  the  inside  and  the  outside  of  the  image  have 
been  identified  by  the  rla.ssificatioD  step,  the  holes  in  the 
image  are  located.  A  hole  is  defined  as  an  area  outside  the 
object.  Thus,  the  background  also  fits  the  definition  of  a 
hole.  These  holes  are  identified  so  that  later  merging  can 
he  accomplished  easily.  This  capability  is  needed  since 
holes  that  are  initially  thought  to  be  separate  may  actu¬ 
ally  be  Joined. 

The  areas  of  the  holes  are  computed  and  recorded  at 
the  same  lime  a'  the  original  hole  identification,  since  the 
data  search  patterns  are  quite  similar.  For  purposes  of 
isolating  the  object  parameters,  the  background  b  defined 
to  have  an  area  of  zero. 

Once  the  inside  of  the  object  is  known,  the  center  of 
mass  of  the  object  if  determined.  Although  in  and  of 
Itself  the  center  of  maas  ls  not  a  particularly  useful 
parameter,  it  is  used  to  normalize  some  of  the  perimeter 
slatLstics  to  be  derived  later. 

To  find  the  perimeter,  the  edge  points  that  are  adja¬ 
cent  to  the  background  are  identified.  Once  thia  has  be 
done,  it  1-  a  simple  m.sticr  to  find  the  distances  from  the 
perimeter  points  to  the  center  of  mass.  These  distances 
are  used  to  calculate  the  average,  minimum,  and  max¬ 
imum  perimeter  distance  from  the  center  of  mass. 

Finally,  using  the  already  determined  perimeter,  a 
description  of  this  perimeter  is  produced  in  the  form  of  a 
list  of  coordinate  pairs  This  list  can  then  be  used  to 
determine  Fourier  descriptors  or  other  similar  parameters. 
Provisions  have  been  made  for  the  processing  of  images 
that  contain  multiple  (non-overlapping)  objects. 

4.  Detailed  Description  oT  the  Parallel  Software 

In  this  section,  details  of  the  vision  algorithms  and  of 
their  parallel  implementation  are  presented.  Results  of 
the  simulation  of  the  parallel  algorithms  and  analysis  of 
the  performance  of  the  parallel  vision  system  are 
presented  in  Section  .I. 

Image  Inltialliation 

To  be  able  to  lest  the  system  easily,  a  simple 
method  by  which  a  user  could  enter  an  image  into  the 
system  was  developed.  The  user  executes  the  vision  pro- 


(run  uid  then  uses  e  stuidud  k^boud  to  direct  the 
curMr  uid  drew  ui  image  border.  The  user  also  has  the 
option  of  turning  the  cursor  on  and  off  to  allow  him/her 
to  draw  unconnected  borders  (such  as  an  internal  border). 
The  connection  pattern  for  the  drawing  is  an  eight  neigh- 
tor  scheme.  That  is,  from  a  given  point,  the  user  can 
direct  the  cursor  in  any  of  the  four  horizontal  and  verti¬ 
cal  directions  as  well  as  along  the  diagonals  between  these 
directions.  After  the  user  has  created  the  image  to 
his/her  satbfaction,  an  exit  command  automatically 
starts  the  image  processing  on  the  given  image. 

The  produced  image  can  be  saved  for  later  testing 
and  can  be  reloaded  and  modified.  The  user  also  has  the 
option  of  either  saving  the  results  in  a  text  file  or  of  sim¬ 
ply  viewing  the  results  as  they  are  proditced. 

For  the  parallel  implementation,  once  the  image  has 
been  created,  it  is  divided  among  the  PEs  with  each  of 
the  PEs  having  an  equally  dimensioned  stripe  (either  hor¬ 
izontal  or  vertical)  of  the  image.  Subsequently,  each  PE 
operates  on  the  section  of  the  image  contained  in  its  local 
memory,  communicating  with  other  PEs  when  further 
information  is  needed. 

Internal  /  External  ClaaalBcation 

As  a  result  of  the  internal/external  classification, 
each  pixel  is  labeled  as  being  on  the  inside  of  the  object, 
outside  the  object,  or  on  tbe  border.  The  classification 
scheme  implemented  is  a  two-pass  method.  The  first  pass 
traverses  the  image  from  the  upper  left  to  the  lower 
right.  The  initial  classification  of  a  pixel  is  based  upon 
tbe  two  neighboring  points  (to  the  left  of  the  current 
point  and  above  the  current  point)  that  have  already 
been  classified.  The  method  tries  to  classify  the  new  point 
as  external  if  either  of  the  previous  points  is  external.  If 
the  adjacent  points  are  both  edges  (border  pixeb),  then 
information  about  the  length  of  the  edge  and  the  previ¬ 
ous  region  classifications  are  used  to  make  tbe 
classification. 

The  second  pass  traverses  the  image  from  the  lower 
right  to  the  upper  left  (backwards  as  compared  to  the 
forward  pass).  Thu  pass  uses  the  four  major  compass 
points  in  relation  to  the  current  point  to  attempt  to 
correct  any  classification  errors.  Again,  the  bias  is  toward 
external  classification. 

This  section  of  the  vision  software  uses  several 
schemes  to  insure  robustness.  Besides  the  ability  to 
reclassify  points  on  the  second  pass,  the  software  also 
looks  for  the  specific  case  of  tracing  an  edge.  In  addition, 
several  trouble  patterns  are  checked  to  prevent  major 
misclassifications.  Figure  1  illustrates  the  classification 
procedure.  Figure  la  is  tbe  image  before  classification 
(border  only).  Tbe  edges  are  represented  by  ‘2.'  Figures 
lb  and  Ic  are  the  image  after  the  first  and  second  passes 
of  the  classification,  respectively.  Internal  points  are 
represented  by  ‘I'  and  external  points  are  represented  by 
|0.'  An  example  of  a  reclassification  on  the  second  pass  is 
illustrated  by  the  outlined  areas  in  Figures  lb  and  Ic. 

In  the*  parallel  implementation,  each  PE  works  with 
its  own  stripe  of  the  image  data.  The  communication 
between  PEs  is  limited  to  the  values  of  the  border  ele¬ 
ments  of  a  subimage.  One  such  transfer  will  take  place 
for  each  border  element  on  one  of  the  sides  of  the  subim- 
sge.  These  transfers  will  be  uniform  modulo  shifts  of  dis¬ 
tance  one.  As  the  results  show  later,  this  section  of  the 
software  demonstrates  good  speedup.  Thus,  the  assump¬ 
tion  of  a  two-pass  classifier  gives  a  conservative  speedup 
estimatioD:  if  more  passes  were  used,  each  pass  would 
exhibit  the  same  good  speedup. 
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Fig.  la.  Initial  image. 
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Fig.  lb.  Clu8ificatk>D:  Pus  1. 
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Fig.  Ic.  Classification:  Pass  2. 
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Identifying  Imnge  Hoiee 

After  the  object  has  been  separated  from  its  sur¬ 
roundings  by  the  classification  operation,  the  holes  in  the 
image  are  identified.  This  process  consists  of  two  steps; 
initial  local  bole  identification  within  each  PE,  followed 
by  merging  of  holes  between  PEs.  Initial  hole  labeling  is 
initially  performed  separately  within  each  PE.  This  is 
done  by  creating  a  template  array  in  each  PE  that  is  of 
the  same  size  as  the  subimage  in  the  PE.  Each  template 
location  contains  an  identifier  that  indicates  the  local  hole 
number  for  the  corresponding  subimage  point  or  zero  for 
non-hole  points.  Each  time  an  external  point  is  located 
that  Ls  not  adjacent  to  a  previous  hole,  a  new  hole 
identifier  is  used  and  entered  for  that  point  in  the  tem¬ 
plate.  If  the  external  point  is  adjacent  to  a  previous  hole, 
then  the  previous  identifier  is  continued.  A  two-neighbor 
scheme  is  used  for  all  of  the  pixels  except  those  on  one  of 
the  .subimage  borders.  Since  the  points  on  one  edge  will 
have  only  points  from  the  previous  row  (or  column,  in  the 
ca.se  of  horizontal  stripes)  to  base  a  derision  upon,  a  one- 
neighbor  scheme  is  used  at  the  borders.  The  software 
maintains  a  set  of  parameters  that  keep  track  of  merged 
holes  and  their  statistics  in  order  to  handle  the  special 
case  of  an  external  point  adjacent  to  two  different  previ- 
ou.s  hole  identifiers.  Experimentation  showed  that  no 
accuracy  problems  were  encountered  due  to  the  small 
number  of  neighbors  used  in  the  classification. 

These  operations  are  performed  totally  within  a  PE: 
no  communication  with  other  FEs  is  needed.  Each  PE 
owns  the  information  about  its  own  holes.  This  informa¬ 
tion  is  transferred  to  other  PEs  during  hole  merging 
(described  later).  Figure  2  shows  the  internal  hole 
identifiers  for  each  PE  Hole  identifiers  that  are  adjacent 
(eg.  labels  .t,  4,  T.,  and  8  in  PE  2)  are  considered  com¬ 
mon  Thai  IS,  only  one  of  the  identifiers  contains  the 
information  for  the  hole.  All  of  the  others  contain  a 
f.ointcr  t/i  the  "master"  information 
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Fig  2  Image  hole  determination. 

Once  the  holes  have  been  identified  within  each  PE, 
they  are  merged  across  the  PE  borders.  This  ia  done  by 
transferring  the  borders  of  the  PE  bole  template  to  adja¬ 
cent  processors  and  searching  for  matching  holes.  The 
areas  are  merged  at  the  same  time  that  holes  are  joined. 
In  the  scheme  used,  if  a  hole  has  only  one  edge  on  a  PE 


border,  then  the  statistics  for  that  hole  are  transferred  to 
that  adjacent  PE.  This  results  in  each  hole  being  "con¬ 
trolled"  by  one  PE.  The  information  that  needs  to  be 
transferred  from  each  PE  is  placed  on  a  transfer  stark. 
These  stacks  are  then  transferred.  Ail  of  these  are 
transfers  to  logically  neighboring  PEIs  (uniform  modulo 
shifts  of  a  distance  of  one).  The  amount  of  information 
transferred  is  highly  dependent  upon  the  actual  image. 
For  purposes  of  easy  identification  and  to  separate  holes 
within  an  object  from  the  background,  the  Imrder  back¬ 
ground  is  defined  as  having  an  area  of  zero.  The  process 
of  mciging  is  illustrated  in  Figure  3. 
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Fig  3.  Hole  merging  example. 

This  method  of  merging  holes  across  PEs  is  deter¬ 
ministic  in  that  the  maximum  number  of  passes  needed 
can  be  determined  by  the  types  of  imSges  being  exam¬ 
ined.  For  example,  the  more  an  object  tends  to  spiral  (a 
spring,  for  example,  a.s  compared  to  a  wheel),  the  more 
pa.sses  that  will  be  needed.  In  order  to  analyze  perfor¬ 
mance,  a  fixed  number  of  passes  (more  than  necessary  for 
the  images  considered)  was  assumed.  In  simulation,  it  was 
found  that  this  section  provides  poor  speedup.  Thus,  the 
net  result  '.‘f  the  fixed  large  number  of  passes  is  again  to 
provide  a  conservative  estimate  of  the  computational 
speedup  of  the  algorithm. 


Compatiag  Image  Hole  Areas 

The  areM  to  be  computed  are  tabulated  at  the  same 
lime  as  the  hole  identifiers  are  placed  in  the  template  in 
each  PE.  The  area  computation  is  therefore  divided 
among  the  PEs.  To  handle  the  merging  of  holes,  either 
within  a  PE  of  between  PEs,  an  indirection  table  that 
points  to  the  actual  hole  area  is  used. 

Locating  the  Center  of  Mass 

After  the  points  that  comprise  an  object  are  known, 
the  center  of  mass  of  the  object  can  be  easily  determined. 
In  this  system  this  step  is  performed  by  computing  the 
moments  in  each  PE  separately  and  then  summing  across 
PEs  using  recursive  doubling  [IS]  (Figure  4).  The 
transfers  used  are  the  cubej  functions,  0  <  i  <  logjN. 
This  scheme  requires  that  each  PE  know  its  absolute 
position  in  the  configuration  since  the  weighting  of  one  of 
the  moments  in  each  PE  is  dependent  upon  the  PE 
address.  For  example,  if  the  stripes  are  in  the  vertical 
direction,  the  the  x  axis  will  be  split  among  the  PEs. 
Moments  that  involve  the  absolute  distance  along  the  x 
axis  will  depend  upon  the  PE  address.  To  obtain  the 
center  of  mass,  log2N  sets  of  transfers  will  be  needed. 
After  the  center  of  mass  has  been  determined,  it  is  broad¬ 
cast  to  all  PEs  since  th;*’  information  will  be  needed  at  a 
local  PE  level  in  later  processing. 
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PE  J 

Pt  3 

PE  l| 

PE  5 
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through  an  indirection  table,  all  one  needs  to  do  is  see  if 
the  hole  has  zero  area.  When  a  perimeter  point  is  located 
in  a  PE,  a  counter  in  that  PE  is  also  incremented  so  th^ 
the  total  perimeter  can  be  determined  by  a  simple  appli¬ 
cation  of  recursive  doubling  to  accumulate  the  total 
across  the  PEs. 

After  the  perimeter  has  been  identified,  it  is  a  simple 
matter  to  find  the  distances  between  the  perimeter ^inU 
and  the  previously  determined  center  of  mass.  This  is 
done  by  scanning  through  the  image  template  looking  for 
perimeter  points.  Each  PE  scans  its  stripe  of  the  image. 
For  each  perimeter  point  so  found,  the  radial  distance 
from  the  perimeter  point  to  the  center  of  mass  is  deter¬ 
mined.  A  running  sum  is  kept  of  these  distances,  along 
with  the  minimum  and  the  maximum  distances.  When 
the  entire  image  has  been  scanned,  recursive  doubling  is 
used  to  find  the  average,  minimum,  and  maximum  such 
distances.  Three  stages  of  recursive  doubling  transfers 
will  be  needed,  one  set  for  each  of  the  perimeter  statistics 
being  gathered.  This  results  in  a  total  of  SlogjN 
transfers. 

Figure  5  shows  the  identified  perimeter  for  an  image. 
The  perimeter  is  noted  by  “B,*  as  compared  to  *2*  for  a 
non-perimeter  edge  point.  Figure  6  shows  an  example  of 
the  overall  output  of  the  vision  software. 

Data  Preparation  for  Fonrier  Dcnerlptom 

As  an  illustration  of  some  of  the  higher  level  func¬ 
tions  that  can  be  performed  once  the  basic  parameters 
have  been  extracted,  the  image  can  be  convertM  into  the 
information  necessary  to  calculate  Fourier  descriptors 
(|7J.  This  information  is  simply  an  ordered  list  represen¬ 
tation  of  the  perimeter  of  the  object.  Each  entry  in  thia 
list  consists  of  a  set  of  ccwrdinates  representing  a  perime¬ 
ter  point.  Fourier  descriptors  have  been  proposM  as  a 
method  of  performing  shape  analysis. 
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Fig.  4.  Example  of  summing  across  PEs  using  recursive 
doubling. 

Perimeter  IdentlBeatloo  and  Perimeter  Statbtica 
Determination 

Identifying  the  perimeter  is  straightforward  once  the 
external  background  hole  has  been  identified.  This  hole 
has  area  zero  by  definition.  An  edge  point  next  U>  an 
external  bole  (or  next  to  another  perimeter  point)  m  a 
perimeter  point.  Since  the  area  of  boles  is  determined 
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Fig.  S.  Object  perimeter  determination  and  center  of 
mass  statistics. 
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Fir.  fl.  Example  of  vision  software  output. 

The  visi.  n  software  begins  this  step  by  forming  the 
perimeter  n  -'les  into  a  multiply-linked  list.  This  is  done 
\  >  hi  ili'atp  the  removal  of  false  perimeter  points  (spikes). 
Th  s  ronverfs  the  perimeter  into  a  traceable  contour. 
N“Xt,  t'.ese  liiikecl-lists  arc  transferred  to  one  PE  which 
f oii'plct i-s  Llic  profcs.s!ng  This  re<)iiires  uniform  modulo 

■  t  ifi-  nf  distances  from  1  to  N  -  I.  This  proccs.sing 
ncl.idcs  converting  the  lists  into  partial  ordered  lists  and 
then  conibining  these  lists  Other  schemes,  such  as  form¬ 
ing  the  [  aitial  lists  in  each  l*E  separately,  were  found  to 
,n  til  1  -.’irh  a  large  amount  of  overhead  in  transfers  that 
an;.-  ad  antsges  in  parallelism  were  lost  The  final  con- 
'onr-  in  the  single  I’E  are  then  broadea.st  to  the 
'criiii.n;!  r  ■  f  iiif  I'F.s  in  preparation  h  r  the  Fourier 
lesenp'  I  ilc’dations  If  the  perimeter  is  equally  distri- 
i.  .Ill  ng  the  I’i'-,  (Ts  -  1)/N  of  the  partial  ordered 
!i  '.(.gs  Alii  11. id  l('  he  transferred.  Each  of  the  objects  in 

■  nr  .r  these  lists  cuitliins  ten  d.ala  fields  (two  link  fields 
fi.r  ih"  liiiked-list  and  eight  neighbor  pointers).  If  the 
pcriMiiier  ,i  not  pi|ii3lly  distributed,  then  the  perimeter 
-1  11!  I  or  g..lhi'rcd  into  the  PE  with  the  largest  number  of 
(■enmeter  points,  and  this  will  require  fewer  total 
transfers  Thus,  if  there  are  P  perimeter  points,  a  max¬ 
imum  of  (N  -  1)P/N  transfers  will  be  needed. 

Muttiple  Object  Images 

The  software  that  has  been  described  to  this  point 
has  treated  the  contents  of  the  image  field  as  one  object. 
If  there  is  more  than  one  object  in  the  image  field,  the 
same  software  can  still  be  used,  but  the  results  will  be  a 
composite  of  the  information  for  the  separate  objects. 
However,  it  IS  not  exceedingly  difficult  to  separate  the 
information  for  the  separate  objects. 

Once  the  contours  of  the  image  have  been  deter¬ 
mined,  the  softwiare  knows  how  many  separate  objects 
are  in  the  image.  This  involves  the  classification,  hole 


and  area  identification  and  merging,  and  perimeter  deter¬ 
mination  steps  described  above.  That  is,  the  number  of 
contours  will  equal  the  number  of  objects  in  the  image 
given  that  the  obiccts  do  not  overlap  and  that  no  object 
is  inside  another  (such  as  a  bolt  within  a  wheel  rim).  The 
items  can  be  processed  individually  by  removing  the 
objects  corresponding  to  the  undesired  contours  and 
reproce.ssing  the  image.  This  can  be  done  for  each  object 
in  the  image.  The  individual  processing  involves  all  the 
the  previous  steps,  from  classification  through  perimeter 
determination  and  perimeter  statistics. 

To  remove  an  object  from  the  image,  its  perimeter 
points  (which  are  known  from  the  contour)  are  marked  to 
be  removed.  Two  passes  are  made  over  the  image  (simi¬ 
lar  to  the  initial  classification)  to  convert  internal, 
perimeter,  and  edge  points  bordering  the  removal  points 
to  removal  points  themselves.  This  is  similar  to  the  ero¬ 
sion  scheme  used  by  C'IyII’4  [II].  A  final  pass  is  made 
over  the  image  to  convert  all  removal  points  to  external 
points,  effectively  erasing  the  object  from  the  image 

If  the  program  detects  multiple  images,  it  will  still 
give  the  composite  result.s,  but  it  will  also  sequentially 
erase  all  but  one  of  the  objects  and  then  process  the 
remaining  object.  This  additional  processing  is  identical 
to  the  main  processing  sequence,  except  the  checks  for 
multiple  objects  are  omitted. 

Additional  Parameters 

Other  parameters  may  be  added  to  a  vision  system 
in  order  to  improve  the  robustness  of  object 
identification,  fsome  of  these  additional  parameters  are 
simply  combinations  of  previous  parameters.  An  example 
of  such  a  parameter  is  the  factor  of  roundness  (how  circu¬ 
lar  the  image  is),  which  is  computed  by  dividing  tr  times 
the  area  by  the  square  of  the  perimeter.  The  area  of  the 
object  could  also  be  calrul.ated  at  the  same  time  that  the 
second  claissification  pa.ss  is  made.  This  area  could  be 
combined  with  the  infernal  hole  area  to  provide  a  total  of 
the  areas  occupied  by  the  object.  The  ratio  of  bole  area 
to  total  area  is  similarly  obtainable. 

There  are  other  parameters  that  would  require  addi¬ 
tional  computation  in  the  main  processing  sequence.  This 
cla-ss  of  parameters  would  include  such  features  as  second 
moments,  ratios  of  major  and  minor  axes,  finding  the 
hounding  rectangle,  and  line  fitting.  Others  could  be 
added  based  upon  the  specific  task  at  hand. 

Finally,  one  needs  to  consider  the  non-ideal  cases 
where  either  mulliple  objects  in  the  image  overlap  or  the 
objects  are  not  entirely  contained  within  the  borders 
Much  information  for  the  latter  case  can  be  obtained 
from  processing  the  ohject  as  usual  and  then  applying 
statistical  methods  to  determine  possible  matches  with 
known  objects.  The  other  case  is  not  as  simple  -  some 
type  of  image  reduction  would  be  necessary  if  it  was 
determined  that  an  object  was  not  known.  Such  software 
could  selectively  reduce  protrusions  of  an  object  until  a 
known  object  was  found. 

6.  Analjrala 

In  order  to  evaluate  the  use  of  the  parallel  architec¬ 
ture  for  computer  vision,  analytical  comparisons  of  the 
parallel  and  serial  algorithms  were  performed  and  the 
simulation  of  the  parallel  software  was  compared  to  the 
serial  implementation.  An  estimation  of  the  computa¬ 
tional  .speedups  was  derived  by  an  examination  of  the 
structure  of  the  parallel  software.  Table  1  summarizes 
the  speedups  for  the  major  algorithms.  The  proportions 


Tkble  1 

Computstiona]  Performuice  Results 


1  Algorithm  Divbion 

Approx.  Speedup 

Serial  Time 

Time  Proportions 

N(I/(I  +  N-I)) 

15.36 

0.3531 

N/((N-I)|SPIFAC+I)) 

15.76 

(called  by  holes) 

'  BSSHSHIHii 

1.64 

0.0377 

i  _ 

N 

10  71 

0.2462 

I  =  Image  Border  (I  by  I  image)  N  =  Number  of  PEs 

SPIFAC  =  How  many  limes  a  section  of  the  object  in  the  image  can  switch 
directions  in  crossing  the  image  (for  example,  the  tetter  “Z*  would  have  a  SPIFAC  of 
2).  For  the  images  analyzed,  SPO^AC  =  6. 


of  time  required  by  different  sections  of  the  code  were 
determined  by  executing  a  serial  version  of  the  algorithm. 
The  time  proportions  are  used  to  provide  a  weighting  of 
the  parallel  speedup  results.  In  this  way,  a  section  with 
low  speedup  that  requires  only  a  small  fraction  of  the 
serial  processing  time  will  not  fabely  lower  the  overall 
speedup.  Similarly,  a  section  with  high  speedup  that 
requires  only  a  small  fraction  of  the  serial  processing  time 
will  not  falMly  rabe  the  overall  speedup.  Using  the  time 
proprotions,  the  total  weighted  speedup  S(N)  can  be  com¬ 
puted: 


S(N)  = 


0  353INI 
l  +  N-I 


03830N 

(N-lHSPIFAC  +  l) 


+  0  0377N  +  0  2462N 


N  0  35311 
^  l  +  N-1 


03630 

(N-lHSPIFAC  +  I) 


-r  0.2839 


One  measure  of  the  performance  of  a  parallel  algo¬ 
rithm  b  the  tffkitnen  E(NJ,  defined  to  be  the  ratio  of  the 
speedup  to  the  number  of  processors  |8|.  Table  2  shows 
the  speedup  and  efficiency  for  the  case  of  a  64  by  M 
image.  For  the  example,  although  the  speedup  increases 
with  N,  the  rate  of  increase  b  not  proportional  to  N  and 
the  efficiency  decreases  gradually  with  N. 

The  experimental  results  for  the  major  sections  of 
the  software  are  presented  in  Table  3.  The  simulations 
were  designed  to  provide  a  conservative  estimate  of  the 
speedup;  assumptions  about  transfer  timings  and  ^n- 
chronization  delays  could  only  be  approximated.  The 
problem  of  non-determinbm  in  speedups  was  bandied  by 
using  determinbtic  versions  of  non-deterministic  routines. 
Again,  these  routines  were  designed  to  provide  a  conser¬ 
vative  estimate  of  the  speedup.  No  overlap  of  processing 
and  transfers  was  assumed,  although  in  many  situations, 
inter-FE  transfers  can  be  performed  at  the  same  time 


Table  2 

Speedup  and  Efficiency  for  1=64 


N 

S(Np 

E(N) 

2 

1  37 

0.683 

4 

2.55 

0.638 

8 

4.88 

0.610 

16 

0  17 

0.573 

32 

16.7 

0.523 

64 

26.6 

0.463 

To  be  compatible  with  the  “curses*  input  method, 
images  were  64  by  23.  The  image  was  divided  into  64/N 
by  23  stripes. 


that  independent  processing  b  occurring.  The  simulation 
results  can  therefore  be  used  as  a  rough  indicator  of  the 
speedup  obtained  by  the  parallel  algorithms.  Both  the 
analytic  and  experimental  results  bear  out  the  observa¬ 
tion  that  the  speedup  will  not  grow  as  N,  because  the 
algorithms  in  which  the  largest  proportion  of  lime  b 
spent  (classification  and  bole  location)  have  less  than 
ideal  speedup.  (The  experimental  speedups  are  somewhat 
less  than  the  analytic  speedups  due  to  the  conservative 
assumptions  made  throughout  the  simulation  and  the 
non-square  image  used.)  Simulation  demonstrated  that 
the  major  problem  with  the  parallel  implementation  b 


Table  3 

Experimental  Speedup  Results 


Serial  Time 

N=2  Time 

N=4  Time 

N=2  Speedup 

15.36 

6.47 

6.02 

1  62 

2  55 

holes  and  areas 

15,76 

13.47 

17.11 

1  17 

062 

center 

1  64 

1  11 

0,86 

1  48 

2  48 

perimeter 

10.71 

5.61 

2.76 

1  61 

384 

overall 

43.50 

26.64 

26  86 

1  47 

1  62 

99 


basically  of  one  form:  the  number  of  transfers  needed 
reduces  the  effectiveness  of  the  par^Ielism.  This  can 
occur  when  the  amount  of  information  that  is  needed  to 
make  a  proper  decision  (such  as  for  hole  merging)  is  large. 
This  problem  can  manifest  itself  in  several  forms,  such  as 
algorithms  that  are  inherently  serial  or  that  require  data 
from  the  entire  image.  Such  tasks  might  better  be  per¬ 
formed  in  one  PE  or  in  the  control  unit. 

9.  Architectural  Coasidcratioiw 

A  specific  type  of  architecture  baa  been  assumed 
throughout  this  simulation  and  analysis.  At  this  point, 
Ibis  restriction  will  be  removed  and  the  tasks  considered 
will  be  examined  to  explore  a  parallel  architecture 
tailored  to  the  characteristics  of  the  vision  task. 

By  examining  the  algorithms,  u  it  seen  that  a  given 
memory  ares  (the  memory  assigned  to  one  PE)  is  not 
needed  by  more  than  two  PEs  in  a  given  proces.sing  sec¬ 
tion  If  the  memory  is  dual  ported,  with  one  write  chan¬ 
nel  and  two  read  channels,  then  the  need  for  transfers 
can  be  virtually  eliminated.  In  such  an  approach,  the 
memory  that  was  previously  the  exclusive  responsibility 
of  a  specific  PE  would  still  be  connected  to  that  PE  via 
the  write  channel  and  one  of  the  read  channels.  How¬ 
ever,  the  other  read  channel  would  be  connected  to  a 
memory  redirection  network  that  would  be  setable  by  the 
control  unit  when  a  new  type  of  access  pattern  is  needed. 
This  redirection  network  could  either  be  bidirectional  or 
(more  practical!  two  unidirectional  networks,  one  direc¬ 
tion  being  used  to  transmit  the  memory  addresses  and 
the  other  being  used  to  return  the  data.  The  advantage 
of  u.sing  two  unidirectional  networks  is  that  information 
can  be  flowing  in  both  directions  at  the  same  time 
without  the  need  for  redirection  or  buffering.  This  would 
allow  the  memory  to  be  accessed  in  an  interleaved 
manner,  further  improving  system  performance.  When 
this  scheme  is  compared  with  the  number  of  transfers 
needed  in  .some  of  the  processing  steps  (such  as  in  hole 
merging  and  Fourier  descriptor  preparation),  the  possible 
savings  are  quite  evident. 

7.  Summary 

In  this  paper,  analytic  and  simulation  results  for  the 
.application  of  parallel  processing  to  the  computer  vision 
task  have  been  presented.  In  general,  it  has  been  shown 
that  for  moderate  numbers  of  processors,  incres.ses  in  per¬ 
formance  (such  ns  overall  speedup)  on  the  order  of  I  for 
an  I  by  I  image  are  obtainable.  Because  of  the  modular 
design  of  the  software  developed,  it  is  quite  possible  to 
expand  the  proces.sing  sequence  to  include  other  common 
image  processing  techniques.  From  the  analytic  and 
.simulation  capabilities  described,  given  specific  s|>eed 
requirements  for  a  particular  vision  task  and  assumptions 
about  proces.sor  speed,  it  will  be  possible  to  determine  the 
number  of  processors  needed  to  satisfy  the  task  r^uire- 
ments.  This  work  contributes  to  the  understanding  of 
the  design  of  parallel  systems  for  image  processing  appli¬ 
cations. 
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1.  INTRODUCTION 

Parallel  processing  has  the  potential  of  providing  fast,  flexible  solutions  to 
many  computationally  intensive  tasks.  In  this  paper,  the  use  of  parallelism 
for  computer  vision  is  described.  Considerations  for  the  design  of  a  parallel 
architecture  for  computer  vision  are  discussed. 

The  vision  task  consists  of  a  number  of  different  algorithms;  several  of  the 
algorithms  have  markedly  different  computational  characteristics.  It  is 
possible  to  achieve  real-time  implementations  of  some  sequences  of  vision 
algorithms  in  hardware.  The  use  of  parallel  processing  allows  signiflcantly 
greater  flexibility,  both  in  the  types  of  images  that  can  be  processed  (e.g., 
gray-level  images  as  well  as  binary)  and  in  the  choice  of  vision  algorithms 
used.  The  work  here  presents  theoretical  analyses  and  simulation  results  for 
a  collection  of  individual  algorithms  and  for  the  overall  vision  task.  This 
paper  extends  the  work  reported  in  Rice  and  Siegel  [1]. 


2.  DEFINITIONS  FOR  THE  PARALLEL  SIMULATION 

In  this  section,  two  general  models  of  parallel  computation  are  defined,  and 
the  specific  model  used  for  the  computer  vision  task  is  presented.  The 
implementation  of  the  parallel  simulation  is  described. 
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2.1  Model 

Single  instruction  stream -multiple  data  stream  (SIMD)  machines 
[2]  represent  a  form  of  synchronous,  highly  parallel  processing.  Systems  with 
up  to  1  000  full  processors  have  been  proposed  [3],  [4];  systems  with  as  many 
as  9  000  and  16  000  simple  processors  have  been  built  [SJ,  [6],  An  SIMD 
machine  typically  consists  of  a  control  unit,  a  set  of  P  processing  elements  (PEs), 
each  a  processor  with  its  own  memory,  and  an  interconnection  network.  The 
control  unit  broadcasts  instructions  to  all  PEs,  and  each  active  PE  executes 
the  instruction  on  the  data  in  its  own  memory.  The  interconnection  network 
allows  data  to  be  transferred  among  the  PEs.  SIMD  machines  are  especially 
well-suited  for  exploiting  the  parallelism  inherent  in  certain  tasks  performed 
on  vectors  and  arrays. 

Multiple  instruction  stream-multiple  data  stream  (MIMD)  machines  [2] 
represent  asynchronous  parallel  processing.  MIMD  systems  with  16  [7]  and 
50  [8J  processors  has'e  been  built;  .MIMD  systems  with  as  many  as  4  000 
processors  [9]  have  been  proposed.  An  MIMD  machine  typically  consists  of 
P  processors  and  M  memories,  M  2:  P,  where  each  processor  can  follow  an 
independent  instruction  stream.  As  with  SIMD  machines,  there  is  a  multiple 
data  stream  and  an  interconnection  network.  Thus,  there  are  P  independent 
processors  that  can  communicate  among  themselves.  There  may  be  a 
coordinator  unit  to  oversee  the  activities  of  the  processors. 

The  parallel  machine  model  assumed  for  the  computer  vision  task  consists 
of  a  set  of  PEs  under  the  management  of  a  control  unit  The  number  of  PEs  is 
a  power  of  two;  -  2".  Each  of  the  PEs  has  a  unique  address  between  0  and 
iV  -  1.  In  addition,  there  exists  an  interconnection  network  to  allow  the 
simultaneous  transfer  of  data  among  the  PEs.  For  the  computer  vision  task, 
the  transfer  patterns  required  will  be  uniform  modulo  shifts  and  cube 
interconnection  functions.  In  a  uniform  modulo  shift,  PE  j  transfers  data  to 
PE  ( ;  d  '  modulo  .V  for  all ;  ,  0  )  <  N  ,  given  a  positive  or  negative 

integer  distance  d  .  The  value  of  d  may  vary  from  one  transfer  to  the  next; 
however,  for  a  given  transfer,  all  PEs  will  send  their  data  the  same  distance 
d  .  The  set  of  ,  uhe  interconnection  functions  consists  of  n  -  log2  functions, 
cube,,  for  0  i  I  <  fi  (lOJ.  U  P„  \  P,  ■  •  Pq  is  the  binary  representa¬ 
tion  of  a  PE’s  address,  then  the  cube,  function  exchanges  data  between  all 
pairs  of  PEs  whose  addresses  differ  in  bit  i: 

cub€,(P„  ,  ■  ■  ■  P,  ■  ■  ■  Po)  =  P„-i  ■  ■  •  P,  ■  ■  ■  Po 

The  model  assumed  here  combines  SIMD  and  MI.MD  attributes.  Each  PE 
contains  the  same  code  but  executes  the  code  on  a  different  subimage. 
However,  within  each  PE,  the  code  can  run  in  MIMD  mode.  This 
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modification  to  the  basic  models  allows  faster  execution  on  some  code  chan  a 
pure  SIMD  model,  without  incurring  the  exptense  of  the  full  flexibility  of  an 
MIMD  machine.  The  gains  in  speed  will  iKcur  on  the  execution  of 
conditional  statements: 

where  <  condition  >  do  <  block  1  > 
elsewhere  do  <  block  2  > 

In  SIMD  mode,  those  PEs  satisfying  the  <  condition  >  execute 
<  block  1  >.  Then  the  remaining  PEs  execute  <  block  2  >.  In  the  model 
here,  <  block  1  >  and  <  block  2  >  will  be  executed  concurrently  but  in 
different  sets  of  PEs.  On  the  other  hand,  this  is  not  full  MIMD  mode,  as  it  is 
required  chat  the  code  in  each  PE  be  the  same.  This  aids  in  enforcing  data 
coherence,  e.g.,  insuring  that  a  PE  acquires  the  correct  version  of  a  variable 
from  another  PE. 

Synchronization  can  take  place  in  one  of  two  ways.  First,  synchronization 
is  required  at  all  data  transfer  points,  because  data  transfers  often  involve  the 
same  variable  for  all  of  the  PEs.  Even  if  the  separate  processors  take  different 
times  to  execute  their  code,  they  will  be  forced  to  synchronize  at  transfers  to 
insure  coherence.  Explicit  synchronization  is  also  possible  by  one  of  the 
simulation  language  constructs  that  requires  that  all  PEs  finish  a  section  of 
code  before  any  can  move  to  the  next  section  of  code. 

The  motivation  for  the  assumed  model  comes  from  two  directions.  First, 
for  many  image  processing  operations,  it  is  natural  to  consider  executing  the 
same  code  on  subimages  of  the  original  image.  Each  subimage  is  a  valid 
image,  and  the  same  types  of  operations  are  needed  on  the  pixels  of  each 
subimage.  Second,  since  the  actual  quantities  of  the  various  operations  that 
will  be  performed  on  each  subimage  may  vary,  asynchronous  operation  may 
allow  higher  PE  utilization. 

This  hybrid  mode  of  ojxration  may  not  be  suitable  for  some  algorithms. 
The  requirements  for  such  a  mode  to  be  useful  are  (1)  that  the  PEs  contain 
and  execute  the  same  code,  with  possible  differences  based  only  on  the 
evaluation  of  conditional  statements,  and  (2)  that  the  need  to  synchronize  at 
data  transfers  does  not  cancel  the  gains  obtained  by  simultaneous  evaluation 
of  conditionals.  For  the  vision  algorithms  examined  here,  these  requirements 
are  met. 


2.2  Simulation 

There  are  two  major  approaches  to  the  development  of  parallel  soft¬ 
ware.  Either  the  software  can  be  of  a  generally  descriptive  nature  to  illustrate 
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ihc  paraileiism  (or  lack  thereof)  inherent  in  a  task,  or  the  software  can 
be  designed  to  be  compilable  and  testable,  either  by  parallel  execution  or 
serial  simulation.  Due  to  the  computational  intensity  and  intricacy  of  the 
computer  vision  task,  the  most  reliable  way  to  insure  correctness  is  by 
testing.  This  guarantees  that  typical  problem  cases  are  being  handled 
correctly  by  testing  the  software  for  a  variety  of  images.  A  set  of  test  images, 
some  with  multiple  objects,  was  used  for  debugging  and  for  analyzing 
computational  speedup.  Therefore,  the  software  was  designed  so  that  it  could 
be  compiled  and  tested. 

Programming  was  done  in  a  modified  version  of  the  C  language  [11].  This 
language  was  chosen  for  the  capabilities  it  provides  for  developing  parallel 
data  structures  and  the  high  degree  to  which  one  can  manipulate  system 
information  (such  as  memory  areas).  The  latter  played  a  large  part  in  the 
simulating  of  parallel  data  transfers.  The  actual  conversion  of  the  serial  C 
language  to  a  parallel  language  was  done  by  means  of  macros  and  support 
subroutines.  These  features  were  designed  to  facihtate  the  development  of 
parallel  code  without  requiring  the  user  to  know  the  specific  details  of  the 
serial  implementation.  Thus,  one  can  simply  use  the  macro  file  without 
knowing  its  details  and  can  then  write  parallel  code. 

The  major  points  of  this  implementation  are  as  follows.  A  construct  of  the 
form 

in_pe  {  codeblock;  | 

executes  the  enclosed  block  of  code  in  each  of  the  PEs.  The  prefix  “PE.” 
prepended  to  a  variable  indicates  that  the  variable  is  local  to  a  PE.  All  other 
variables  are  assumed  to  be  global  (i.e.,  the  control  unit  has  one  copy  of  the 
variable).  Global  variables  are  used  for  such  operations  as  loop  control  and 
overall  conditional  testing.  There  are  also  versions  of  the  “in_pe”  construct 
that  allow  the  code  to  be  executed  in  a  limited  subset  of  the  PEs.  These 
schemes  use  an  address  mask  [12],  which  is  a  matching  format  that  the  PE 
address  must  match  for  e.xecution  to  occur  in  that  PE. 

Interprocessor  communication  is  accomplished  by  a  transfer  subroutine: 

transfer  (destination_address,  source_address,  oflfset) 

The  transfer  routine  uses  these  addresses  along  with  information  about  the 
size  and  structure  of  the  PE  data  space  to  simulate  the  transfer  by  a 
memory-to-memory  move.  Recursive  transfers  and  broadcasts  (in  which  one 
value  is  transferred  to  all  of  the  PEs)  arc  similar.  Synchronization  is  needed  at 
transfer  points  to  insure  data  coherence. 

The  vision  software  and  simulations  were  run  on  a  dual -processor  Vax 
11/780  [13]. 
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In  this  section,  an  overview  of  the  computer  vision  algorithms  is  provided. 
The  parameters  described  are  based  on  the  SRI  vision  module  [14]  and 
Fourier  descriptors  [15|. 

A  simple  mechanism  for  entering  an  image  into  the  system  was  desired.  In 
the  method  chosen,  the  user  employs  a  terminal  with  cursor  control  to  draw 
an  image  on  the  screen  and  enter  that  image  into  the  data  memory.  This 
section  of  the  code  used  a  small  subsection  of  the  “curses”  (16)  utilities 
available  on  the  test  system.  This  was  later  expanded  to  allow  other  image 
formats  to  be  input.  The  images  used  here  and  in  the  subsequent  steps  are 
assumed  to  be  binary  images,  although  the  algorithms  can  be  generalized  to 
handle  gray-level  images. 

After  an  image  has  been  entered  into  the  data  memory,  the  first  task  is  to 
classify  the  image.  This  consists  of  transforming  an  image  comprised  of  edge 
and  non-edge  pixels  into  an  image  with  edge,  internal,  and  external  pixels. 
An  internal  pixel  is  a  pixel  that  represents  a  point  on  an  object,  whereas  an 
external  pixel  represents  a  point  external  to  an  object  (such  as  the  external 
background  or  a  hole  in  the  object). 

After  the  inside  and  the  outside  of  the  image  have  been  identified  by  the 
classification  step,  the  holes  in  the  image  are  located.  A  hole  is  defined  as  an 
area  outside  the  object.  Thus,  the  background  also  fits  the  definition  of  a 
hole.  These  holes  are  identified  so  that  later  merging  can  be  accomplished 
easily.  This  capability  is  needed  because  holes  that  are  initially  thought  to  be 
separate  may  actually  be  joined. 

The  areas  of  the  holes  are  computed  and  recorded  at  the  same  time  as  the 
original  hole  identification,  because  the  data  search  patterns  are  similar.  For 
purposes  of  isolating  the  object  parameters,  the  background  is  defined  to 
have  an  area  of  zero. 

Once  the  inside  of  the  object  is  known,  the  center  of  mass  of  the  object  is 
determined.  Although  in  and  of  itself  the  center  of  mass  is  not  a  particularly 
useful  parameter,  it  is  used  to  normalize  some  of  the  perimeter  statistics  to  be 
derived  later. 

To  find  the  perimeter,  the  edge  points  that  are  adjacent  to  the  background 
are  identified.  Once  this  has  been  done,  it  is  a  simple  matter  to  find  the 
distances  from  the  perimeter  points  to  the  center  of  mass.  These  distances  are 
used  to  calculate  the  average,  minimum,  and  maximum  perimeter  distance 
from  the  center  of  mass. 

Finally,  using  the  already  determined  perimeter,  a  description  of  this 
perimeter  is  produced  in  the  form  of  a  list  of  coordinate  pain.  This  list  can 
then  be  used  to  determine  Fourier  descriptors  or  other  similar  parameters. 
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Provisions  have  been  made  for  the  processing  of  images  that  contain  multiple 
(nonoverlapping)  objects. 

4.  DETAILED  DESCRIPTION  OF  THE  PARALLEL  SOFTWARE 

In  this  section,  details  of  the  vision  algorithms  and  of  their  parallel 
implementation  are  presented.  Results  of  the  simulation  of  the  parallel 
algorithms  and  analysis  of  the  perfomance  of  the  parallel  vision  system  are 
presented  in  Section  5. 

4.1  Image  initialization 

To  be  able  to  test  the  system  easily,  a  simple  method  by  which  a  user  could 
enter  an  image  into  the  system  was  developed.  The  user  executes  the  vision 
program  and  then  uses  a  standard  keyboard  to  direct  the  cursor  and  draw  an 
image  border.  The  user  also  has  the  option  of  turning  the  cursor  on  and  ofif  to 
allow  the  drawing  of  unconnected  borders  (such  as  an  internal  border).  The 
connection  pattern  for  the  drawing  is  an  eight-neighbor  scheme.  That  is, 
from  a  given  point,  the  user  can  direct  the  cursor  in  any  of  the  four  horizontal 
and  vertical  directions  as  well  as  along  the  diagonals  between  these  directions. 

The  screen  size  does  not  limit  the  size  of  the  image  being  created,  as  the 
screen  merely  acts  as  a  window  into  the  image.  During  image  creation  the 
current  position  of  the  cursor  is  maintained  in  the  upper  left-hand  corner  of 
the  screen.  Messages  and  inputs  are  handled  on  the  lowest  line  of  the  screen. 
If  the  drawing  gets  too  near  to  any  of  the  borders,  the  window  into  the  image 
is  automatically  moved.  The  user  can  also  specify  a  location  to  which  to  move 
the  cursor.  If  this  position  is  not  in  the  current  window,  the  window  is 
automatically  moved.  All  borders  arc  strictly  enforced:  The  user  cannot  draw 
beyond  the  edge  of  the  border  under  any  condition.  After  the  user  has  created 
the  image,  an  exit  command  automatically  starts  the  image  processing  on  the 
image. 

In  addition,  images  with  256  gray  levels  that  are  stored  as  character  arrays 
(e.g.,  one  character  per  pixel)  can  be  loaded  by  the  system.  Simple 
thresholding  routines  as  well  as  a  Subei  operator  are  automatically  applied  to 
such  images  to  convert  them  into  binary  images.  The  user  is  prompted  for  the 
thresholds  for  each  image. 

The  produced  image  can  be  saved  for  later  testing  and  can  be  reloaded  and 
modified.  The  user  also  has  the  option  of  saving  the  results  in  a  text  file  or  of 
viewing  the  results  as  they  are  produced. 
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For  the  parallel  unplementaiion,  once  the  image  has  been  created,  it  is 
divided  among  the  PEs  with  each  of  the  PEs  having  an  equally  dimensioned 
stripe  (either  horizontal  or  vertical)  of  the  image.  Subsequently,  each  PE 
operates  on  the  section  of  the  image  contained  in  its  local  memory, 
communicating  with  other  PEs  when  further  information  is  needed. 

4.2  Internal/external  classification 

The  intemal/extemal  classification  labels  each  pixel  as  being  on  the  inside 
of  the  object,  outside  the  object,  or  on  the  border.  The  classification  scheme 
implemented  is  a  two-pass  method.  The  first  pass  traverses  the  image  from 
the  upper  left  to  the  lower  right.  The  initial  classification  of  a  pixel  is  based  on 
the  two  neighboring  points  (to  the  left  of  the  current  point  and  above  the 
current  point)  that  have  already  been  classified.  The  method  tries  to  classify 
the  new  point  as  external  if  either  of  the  previous  points  is  external.  If  the 
adjacent  points  are  both  edges  (border  pixels),  then  information  about  the 
length  of  the  edge  and  the  previous  region  classifications  are  used  to  make  the 
classification. 

The  second  pass  traverses  the  image  from  the  lower  right  to  the  upper  left 
(backward,  as  compared  with  the  forward  pass).  This  pass  uses  the  four 
major  compass  points  in  relation  to  the  current  point  to  attempt  to  correct 
any  classification  errors.  Again,  the  bias  is  toward  external  classification. 

This  section  of  the  vision  software  uses  several  schemes  to  insure  robust¬ 
ness.  Besides  the  ability  to  reclassify  points  on  the  second  pass,  the  software 
also  looks  for  the  specific  case  of  tracing  an  edge.  In  addition,  several  trouble 
patterns  are  checked  to  prevent  major  misciassihcations.  Figure  1  illustrates 
the  classification  procedure.  Figure  1(a)  is  the  image  before  classification 
(border  only).  The  edges  are  represented  by  ‘2.’  Figures  1(b)  and  1(c)  are 
the  image  after  the  first  and  second  passes  of  the  classification,  respectively. 
Internal  points  are  represented  by  ‘1,’  and  external  points  are  represented  by 
‘0.’  An  example  of  a  reclassification  on  the  second  pass  is  illustrated  by  the 
outlined  areas  in  Fig.  1(b)  and  1(c). 

In  the  parallel  implementation,  each  PE  works  with  its  own  stripe  of  the 
image  data.  The  communication  between  PEs  is  limited  to  the  values  of  the 
border  elements  of  a  subimage.  One  such  transfer  takes  place  for  each  border 
element  on  one  of  the  sides  of  the  subimage.  These  transfers  are  uniform 
modulo  shifts  of  distance  one.  As  the  results  show  later,  this  section  of  the 
software  demonstrates  good  speedup.  Thus,  the  assumption  of  a  two-pass 
classifier  gives  a  conservative  speedup  estimation:  if  more  passes  were  used, 
each  pass  would  exhibit  the  same  good  speedup. 
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hold  identifiers.  Experimentation  showed  that  no  accuracy  problems  were 
introduced  by  the  small  number  of  neighbors  used  in  the  classification. 

These  operations  are  performed  totally  within  a  PE:  no  communication 
with  other  PEs  is  needed.  Each  PE  owns  the  information  about  its  own  holes. 
This  information  is  transferred  to  other  PEs  during  hole  merging  (described 
later).  Figure  2  shows  the  internal  hole  identifiers  for  each  PE.  Hole 
identifiers  that  are  adjacent  (e.g.,  labels  3, 4,  5,  and  6  in  PE  2)  are  considered 
common.  That  is,  only  one  of  the  identifiers  contains  the  information  for  the 
hole.  All  of  the  others  contain  a  pointer  to  the  “master"  information. 

Once  the  holes  have  been  identified  in  each  PE,  they  are  merged  across  the 
PE  borders.  This  is  done  by  transferring  the  borders  of  the  PE  hole  template 
to  adjacent  processors  and  searching  for  matching  holes.  The  areas  are 
merged  at  the  same  time  that  holes  are  joined.  In  the  scheme  used,  if  a  hole 
has  only  one  edge  on  a  PE  border,  then  the  statistics  for  that  hole  are 
transferred  to  that  adjacent  PE.  This  results  in  each  hole  being  “controlled” 
by  one  PE.  The  information  that  needs  to  be  transferred  from  each  PE  is 
placed  on  a  transfer  stack.  These  stacks  are  then  transferred.  All  of  these  are 
transfers  to  logically  neighboring  PEs  (uniform  modulo  shifts  of  a  distance  of 
one).  The  amount  of  information  transferred  is  highly  dependent  on  the 
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Fig.  2  Image  hole  detertnjnaiion 
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actual  image.  For  purposes  of  easy  identification  and  for  separation  of  holes 
within  an  object  from  the  background,  the  border  background  is  defined  as 
having  an  area  of  zero.  The  process  of  merging  is  illustrated  in  Fig.  3. 

This  method  of  merging  holes  across  PEs  is  deterministic  in  that  the 
maximum  number  of  passes  needed  can  be  determined  by  the  types  of  images 
being  examined.  For  example,  the  more  an  object  tends  to  spiral  (a  spring, 
for  example,  as  compared  with  a  wheel),  the  more  passes  are  needed.  To 
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Fig.  3  Hole  merging  example 
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analyze  performance,  preliminary  tests  assumed  a  fixed  number  of  passes 
(more  than  necessary  for  the  images  considered).  In  simulation,  it  was  found 
that  this  section  provides  poor  speedup.  Thus,  for  this  step,  the  net  result  of 
the  fixed  large  number  of  passes  is  again  a  conservative  estimate  of  the 
computational  speedup  of  the  algorithm.  A  refinement  of  the  algorithm  was 
also  tested.  By  using  only  the  required  number  of  passes,  appreciable 
improvements  in  speedup  were  obtained. 


4.4  Computing  image  hole  areas 

The  areas  to  be  computed  are  tabulated  at  the  same  lime  as  the  hole 
identifiers  are  placed  in  the  template  in  each  PE.  The  area  computation  is 
therefore  divided  among  the  PEs.  To  handle  the  merging  of  holes,  either 
within  a  PE  or  between  PEs,  an  indirection  table  that  points  to  the  actual  hole 
area  is  used. 


4.5  Locating  the  canter  of  mass 

After  the  points  that  comprise  an  object  are  known,  the  center  of  mass  of 
the  obiect  can  be  easily  determined.  In  this  system  this  step  is  performed 
by  computing  the  moments  in  each  PE  separately  and  then  summing  across 
PEs  using  recursive  doubling  [17]  (Fig.  4).  The  transfers  used  are  the  cube, 
functions,  0  t  <  logj  N.  This  scheme  requires  that  each  PE  know  its 
absolute  position  in  the  configuration  because  the  weighting  of  one  of  the 
moments  in  each  PE  is  dependent  upon  the  PE  address.  For  example,  if  the 
stripes  are  in  the  vertical  direction,  then  the  x  axis  is  split  among  the  PEs. 
Moments  that  involve  the  absolute  distance  along  the  x  axis  depend  on  the  PE 
address.  To  obtain  the  center  of  mass,  log2  N  sets  of  transfers  arc  needed. 
After  the  center  of  mass  has  been  determined,  it  is  broadcast  to  all  PEs, 
because  this  information  is  needed  at  a  local  PE  level  in  later  processing. 


4.6  Perimater  identification  and  perimeter 
statistics  determination 

Identifying  the  perimeter  is  straightforward  once  the  external  background 
hole  has  been  identified.  This  hole  has  area  zero  by  definition.  An  edge  point 
next  to  an  external  hole  (or  next  to  another  perimeter  point)  is  a  perimeter 
point.  Since  the  area  of  holes  is  determined  through  an  indirection  table,  all 
one  needs  to  do  is  see  if  the  hole  has  zero  area.  UTien  a  perimeter  point  is 
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Fig.  4  Example  of  summing  across  PEs  using  recursive  doubling 


located  in  a  PE,  a  counter  in  that  PE  is  also  incremented  so  that  the  total 
perimeter  can  be  determined  by  a  simple  application  of  recursive  doubling  to 
accumulate  the  total  across  the  PEs. 

After  the  perimeter  has  been  identified,  it  is  a  simple  matter  to  find  the 
distances  between  the  perimeter  points  and  the  previously  determined  center 
of  mass.  This  is  done  by  scanning  through  the  image  template  looking  for 
perimeter  points.  Each  PE  scans  its  stripe  of  the  image.  For  each  perimeter 
point  found,  the  radial  distance  from  the  peruneter  point  to  the  center  of 
mass  is  determined.  A  running  sum  is  kept  of  these  distances,  along  with  the 
minimum  and  the  maximum  distances.  When  the  entire  image  has  been 
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Ftg.  5  Object  perimeter  determination  and  center -of-mass  statistics.  Total  object 
perimeter  is  109;  the  centre  of  mass,  (33,  11);  distances  from  center  of  mass  to 
perimeter,  min  3,  max  24,  average  12 
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Ftg  6  Example  of  vision  software  output.  Two  holes  in  image;  total  peruneter  is  109; 
total  hole  ai-ea  7;  center  of  mass  33,  1  P;  distances  from  center  ol  mass  to  perimeter, 
min  3,  max  24,  average  12;  one  object  in  image 
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scanned,  recursive  doubling  is  used  to  find  the  average,  minimum,  and 
maximum  distances.  Three  stages  of  recursive  doubling  transfers  are  needed, 
one  set  for  each  of  the  perimeter  statistics  being  gathered.  This  results  in  a 
total  of  3  log2  N  transfers. 

Figure  5  shows  the  identified  perimeter  for  an  image.  The  perimeter 
(border)  is  noted  by  “B,”  as  compared  with  “2"  for  a  nonperimeter  edge 
point.  Figure  6  shows  an  example  of  the  overall  output  of  the  vision  software. 

4.7  Data  preparation  for  Fourier  descriptors 

As  an  illustration  of  some  of  the  higher  level  functions  that  can  be 
performed  once  the  basic  parameters  have  been  extracted,  the  image  can  be 
converted  into  the  information  necessary  to  calculate  Fourier  descriptors 
[15].  This  information  is  simply  an  ordered  list  representation  of  the 
perimeter  of  the  object.  Each  entry  in  this  list  consists  of  a  set  of  coordinates 
representing  a  perimeter  point.  Fourier  descriptors  have  been  proposed  as  a 
method  of  performing  shape  analysis. 

The  vision  software  begins  this  step  by  forming  the  perimeter  nodes  into  a 
multiply  linked  list,  which  facilitates  the  removal  of  false  perimeter  points 
(spikes).  This  converts  the  perimeter  into  a  traceable  contour.  Next,  these 
linked  lists  are  transferred  to  one  PE  which  completes  the  processing.  This 
requires  uniform  modulo  shifts  of  distances  from  1  to .  N  -  1 .  This 
processing  includes  converting  the  lists  into  partial  ordered  lists  and  then 
combining  these  lists.  Other  schemes,  such  as  forming  the  partial  lists  in  each 
PE  separately,  were  found  to  induce  such  a  large  amount  of  overhead  in 
transfers  that  any  advantages  in  parallelism  were  lost.  The  final  contours  in 
the  single  PE  are  then  broadcast  to  the  remainder  of  the  PEs  in  preparation 
for  the  Fourier  descriptor  calculations.  If  the  perimeter  is  equally  distributed 
among  the  PEs,  (N  -  1)/,V  of  the  partial  ordered  listings  need  to  be 
transferred.  Each  of  the  objects  in  one  of  these  lists  contains  ten  data  fields 
(two  link  fields  for  the  linked  list  and  eight  neighbor  pointers)  If  the 
penmeter  is  not  equally  distnbuted,  then  the  perimeter  could  be  gathered 
into  the  PE  with  the  largest  number  of  perimeter  points,  and  this  requires 
fewer  total  transfers.  Thus,  if  there  are  P  perimeter  points,  a  maximum  of 
(N  -  1)  P iS  transfers  are  needed. 

4.8  Multipl*  object  images 

The  software  that  has  been  described  has  treated  the  content  of  the  image 
field  as  one  object.  If  there  is  more  than  one  (nonoverlapping)  object  in  the 
image  field,  the  same  software  can  still  be  used,  but  the  results  will  be  a 
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composite  of  the  information  for  the  separate  objects.  However,  it  is  not 
exceedingly  difficult  to  separate  the  information  for  the  separate  objects. 

Once  the  contours  of  the  image  have  been  determined,  the  software  knows 
how  many  separate  objects  are  in  the  image.  This  involves  the  classification, 
hole  and  area  identification  and  merging,  and  perimeter  determination  steps 
described  above.  That  is,  the  number  of  contours  equals  the  number  of 
objects  in  the  image,  given  that  the  objects  do  not  overlap  and  that  no  object 
is  inside  another  (such  as  a  bolt  in  a  wheel  rim).  The  items  can  be  processed 
individually  by  removing  the  objects  corresponding  to  the  undesired  con¬ 
tours  and  reprocessing  the  image.  This  can  be  done  for  each  object  in  the 
image.  The  individual  processing  mvolves  all  the  the  previous  steps,  from 
classification  through  perimeter  determination  and  perimeter  statistics. 

To  remove  an  object  from  the  image,  its  perimeter  points  (known  from  the 
contour)  are  marked  to  be  removed.  Two  passes  are  made  over  the  image 
(similar  to  the  initial  classification)  to  convert  internal,  perimeter,  and  edge 
points  bordering  the  removal  points  to  removal  points  themselves.  This  is 
similar  to  the  erosion  scheme  used  by  CLIP4  [18].  A  final  pass  is  made  over 
the  image  to  convert  all  removal  points  to  external  points,  effectively  erasing 
the  object  from  the  image. 

If  the  program  detects  multiple  images,  it  still  gives  the  composite  results, 
but  it  also  sequentially  erases  all  but  one  of  the  objects  and  then  processes  the 
remaining  object.  This  additional  processing  is  identical  to  the  main  proces¬ 
sing  sequence,  except  that  the  checks  for  multiple  objects  are  omitted. 


4.9  Additional  parameters 

Other  parameters  can  be  added  to  a  vision  system  to  improve  the  robust¬ 
ness  of  object  identification.  Some  of  these  additional  parameters  are  simply 
combinations  of  previous  parameters.  An  example  of  such  a  parameter  is  the 
factor  of  roundness  (how  circular  the  image  is),  which  is  computed  by 
dividmg  4  nr  tunes  the  area  by  the  square  of  the  perimeter.  The  area  of  the 
object  could  also  be  calculated  at  the  same  time  that  the  second  pass  in  the 
internal/external  classification  step  is  made.  This  area  could  be  combined 
with  the  internal  hole  area  to  provide  a  total  of  the  areas  occupied  by  the 
object.  The  ratio  of  hole  area  to  total  area  is  similarly  obtainable. 

There  are  other  parameters  that  would  require  additional  computation  in 
the  main  processing  sequence.  This  class  of  parameters  includes  such 
features  as  second  moments,  ratios  of  major  and  minor  axes,  finding  the 
bounding  rectangle,  and  line  littrng.  Others  could  be  added,  based  upon  the 
specific  task  at  hand. 
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Finally,  one  needs  to  consider  the  nonideal  cases  in  which  multiple  objects 
in  the  image  overlap  or  the  objects  are  not  entirely  contained  within  the 
borders.  Much  information  for  the  latter  case  can  be  obtained  from  proces¬ 
sing  the  object  as  usual  and  then  applying  statistical  methods  to  determine 
possible  matches  with  known  objects.  The  other  case  is  not  as  simple:  Some 
type  of  image  reduction  is  necessary  if  it  is  determined  that  an  object  is  not 
known.  Such  software  could  selectively  reduce  protrusions  of  an  object  until 
a  known  object  is  identified. 


5.  ANALYSIS 

To  evaluate  the  use  of  the  parallel  architecture  for  computer  vision, 
analytical  comparisons  of  the  parallel  and  serial  algorithms  were  performed, 
and  the  simulation  of  the  parallel  software  was  compared  to  the  serial 
implementation.  An  estimation  of  the  computational  speedups  was  derived 
by  an  examination  of  the  structure  of  the  parallel  software.  Table  I 
summarizes  the  speedups  for  the  major  algorithms.  The  proportions  of  time 
required  by  different  sections  of  the  code  were  determined  by  executing  a 
serial  version  of  the  algorithm.  (The  serial  algorithm  does  not  incur  any 
overhead  for  operations  such  as  transfers  or  processor  disabling.)  The  time 
proportions  are  used  to  provide  a  weighting  of  the  parallel  speedup  results.  In 
this  way,  a  section  with  low  speedup  that  requires  only  a  small  fraction  of  the 
serial  processing  time  does  not  falsely  lower  the  overall  speedup.  Similarly,  a 
section  with  high  speedup  that  requires  only  a  small  fraction  of  the  serial 
processing  time  does  not  falsely  raise  the  overall  speedup.  With  the  time 


Table  I 

Computational  performance  results 


Algorithm  division 

,‘\pprox.  speedup 

Serial  time 

Time  proportions 

ClassO 

15  36 

0.3531 

Holest) 

AVa.V-lxSPIFAC*-!)) 

15.79 

0.3630 

AreaK) 

icalled  by  holes) 

N'A 

N  A 

Centett ) 

N 

1  64 

0.0377 

PsialsO 

N 

10.71 

0  2462 

Note:  /  =  Image  border  (!•<  I  image);  N  =  Number  of  PEs;  SPIFAC  =  number  of  nmet  a 
tectum  of  the  object  in  the  image  can  switch  directions  in  crossing  the  image  ■  for  example,  the  letter  Z 
icould  hate  n  SPIFAC  of  2'.  For  the  images  analyzed,  SPIF,AC  -  6 
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Table  2 

Parallel  simulalion:  experimental  results  '64  «  64  image)" 


Algorithm 

/  PE 
Avg 

2  PEs 

Avg  Norm 

4  PEs 

Avg  Sorm 

S  PEs 

Avg  Norm 

Class 

40.25 

42.25 

21  13 

50.25 

12.63 

58.67 

7.333 

Holes  and  areas 

48.75 

79.75 

39  88 

183.0 

45.75 

642  0 

80.25 

Center 

5.25 

6.25 

3  125 

6  5 

1  625 

7  0 

0.875 

Pstats 

14.75 

15.25 

7.625 

12.75 

3.188 

15.33 

1.917 

Tune  subtotal 

109.0 

71.76 

63.19 

90.38 

Partial  speedup 

1 

1.52 

1.72 

1.21 

Chain  !  serial  i 

54.75 

23  75 

23.75 

28 

28 

27.67 

27.67 

Chain  parallel) 

N/A 

76.75 

38  38 

139.25 

34  81 

272.33 

34.04 

Total  umc 

17T75 

133.89 

126.0 

152.09 

Overall  speedup 

1 

1.30 

1.38 

1.14 

Efficiency 

1 

0.65 

0.345 

0  143 

fimei  tn  I  setonj 


proportions,  the  total  weighted  speedup  SiN)  for  processing  an  /  x  /  image 
using  .V  PEs  can  be  computed: 


S^N) 


0.3531NI 
I  -  N  -  I 


0.3630N 

(N  -  l)(SPIFAC  -r  1) 


+  0.0377N  +  0.2462N 


0.35311 
I  +  N  -  1 


0.3630 

(N  -  1)(SPIFAC  +  1) 


0.2839 


The  experimental  results  for  the  major  sections  of  the  software  are 
presented  in  Table  2.  The  columns  labeled  Avg  give  the  average  time  the 
serial  simulation  twk  for  each  step  of  the  algorithm.  The  columns  labeled 
Norm  give  the  conversions  of  the  average  serial  times  to  the  average  par¬ 
allel  times.  This  is  the  normalized  execution  time.  The  Time-subtotal 
row  indicates  how  much  time  the  first  four  component  algorithms 
Unternal/external  classification,  hole  identification  assuming  a  fixed  number 
of  passes,  center  of  mass  and  perimeter  statistics)  required.  The  speedup  that 
these  partial  times  indicate  is  presented  in  the  Partial-speedup  row.  The  final 
algorithm  step,  formation  of  the  chain  code  representation  of  the  perimeter, 
is  represented  by  two  rows  in  the  tables,  because  it  has  both  a  serial  and  a 
parallel  component.  Finally,  the  Total-time  and  Overall-speedup  rows 
indicate  the  time  that  the  entire  proc  sing  operation  needed  and  the  speedup 
reflected  by  this  time. 

One  additional  measure  of  the  performance  of  a  parallel  algorithm  is  the 
efficiency  Ei.N\  defined  to  be  the  ratio  of  the  speedup  to  the  number  of 
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processors  [19],  Table  2  also  shows  the  speedup  for  the  case  of  a  64  x  64 
image.  For  the  example,  although  the  speedup  increases  with  N,  the  rate  of 
increase  is  not  proportional  to  N,  and  the  efficiency  decreases  fairly  sharply 
with  N. 

The  simulations  were  designed  to  provide  a  conservative  estimate  of  the 
speedup;  assumptions  about  transfer  timings  and  synchronization  delays 
were  approximated.  The  problem  of  nondeterminism  in  speedups  was 
handled  by  using  deterministic  versions  of  nondeterministic  routines.  Again, 
these  routines  were  designed  to  provide  a  conservative  estimate  of  the 
speedup.  No  overlap  of  processing  and  transfers  was  assumed,  although  in 
many  situations,  inter-PE  transfers  can  be  performed  at  the  same  time 
that  independent  processing  is  occurring.  The  simulation  results  can,  there¬ 
fore,  be  used  as  a  rough  indicator  of  the  speedup  obtained  by  the  parallel 
algorithms.  Both  the  analytic  and  experimental  results  bear  out  the  observa¬ 
tion  that  the  speedup  will  not  grow  as  N,  because  the  algorithms  in  which  the 
largest  proportion  of  time  is  spent  (hole  merging  and  chain  code  formation) 
have  less  than  ideal  speedup.  (The  experimental  speedups  are  somewhat  less 
than  the  analytic  speedups  due  to  the  conservative  assumptions  made 
throughout  the  simulation.)  In  particular,  the  discrepancy  between  the 
theoretical  and  the  experimental  results  is  primarily  in  the  holes  and  areas 
section.  In  this  section,  the  theoretical  results  take  into  account  the  number 
of  times  the  merging  must  be  performed  but  do  not  take  into  account  the 
overhead  incurred  by  the  transfers  required  by  the  merging.  This  overhead 
turns  out  to  be  a  substantial  portion  of  the  algorithm,  to  the  extent  that  it 
destroys  the  effectiveness  of  the  increased  parallelism.  It  appears  that  having 
subimages  less  than  16  pixels  wide  is  counterproductive. 

To  address  the  problems  with  the  hole  merging  algorithm,  a  new  version  of 
this  algorithm  was  constructed  that  performs  only  the  required  number  of 
hole  merging  steps  (thus  removing  one  of  the  earlier  conservative  assump¬ 
tions).  The  algorithm  is  divided  into  two  parts,  which  correspond  to 
single-sided  hole  merging  (such  as  was  illustrated  earlier)  and  multiple-edged 
hole  merging  (which  handles  ringlike  holes  such  as  the  background  hole). 
Each  of  these  stages  proceeds  until  the  number  of  holes  merged  in  each  PE  is 
zero.  This  has  the  advantage  of  eliminating  unneeded  overhead  as  well  as 
having  the  capability  of  dealing  with  pathological  cases  that  might  require 
additional  merging  steps. 

The  results  for  this  software  with  this  modification  included  are  in  Table 
3.  Note  that  with  this  modification,  for  64  x  64  images,  eight  processors 
still  provide  speedup  gains,  whereas  previously  only  two  or  four  could  be 
used  before  the  results  deteriorated  due  to  the  overhead  of  the  parallelism. 
With  eight  processors,  the  stripes  in  each  PE  are  only  eight  pixels  wide,  so 
the  proportion  of  time  spent  in  overhead  to  coordinate  between  PEs  is 
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Table  3 

Son-deiermnistic  merging  parallel  simulation  results  {64  '<  64  image)" 


Algonthm 

/  PE 
Avg 

2  PUs 

Avg  Norm 

4 

Avg 

PEs 

Norm 

S  PEs 

.3  vg  Sorm 

Class 

40.25 

45  25 

22  63 

49  0 

12.25 

58.67 

7  333 

tiulcs  and  areas 

48.75 

83  0 

41  5 

121  25 

30  31 

249  0 

31.13 

Center 

5.25 

6.5 

3  25 

6  75 

1  688 

6.67 

0.8334 

Pstats 

14.75 

15.75 

7.875 

14  0 

3.5 

15  33 

1.917 

Time  subtotal 

109.0 

75  25 

Ml’s 

41.21 

Partial  speedup 

1 

1  45 

2  28 

2.64 

Chain  'serial) 

54. ■’5 

24.0 

24  0 

28  0 

28  0 

28  33 

28.33 

Chain  parallel) 

N.'.'V 

81.0 

40  5 

144  25 

36  06 

34  08 

Toral  time 

173.75 

139.76 

111  81 

103  62 

Overall  speedup 

1 

1  24 

1  55 

1.68 

Efficiency 

1 

0  62 

0  388 

0.210 

"  Times  in  I /60th  second 


substantial.  Thus,  simulation  demonstrated  that  the  major  problem  with  the 
parallel  implementation  is  basically  of  one  form:  The  number  of  transfers 
needed  reduces  the  effectiveness  of  the  parallelism.  This  can  occur  when  the 
amount  of  information  that  is  needed  to  make  a  proper  decision  (such  as  for 
hole  merging)  is  large.  This  problem  can  manifest  itself  in  several  forms,  such 
as  algorithms  that  are  inherently  serial  or  that  require  data  from  the  entire 
image.  Such  tasks  might  better  be  performed  in  one  PE  or  in  the  control  unit. 


6.  ARCHITECTURAL  CONSIDERATIONS 

A  specific  type  of  architecture  has  been  assumed  throughout  this  simulation 
and  analysis.  At  this  point,  this  restriction  will  be  removed,  and  the  tasks 
considered  will  be  examined  to  explore  a  parallel  architecture  tailored  to  the 
characteristics  of  the  vision  task. 

By  examining  the  algorithms,  one  can  see  that  a  given  memory  area  (the 
memory  to  be  accessed  in  an  interleaved  manner,  further  improving  system 
processing  section.  If  the  memory  is  dual-ported,  with  one  vmie  channel  and 
two  read  channels,  then  the  need  for  transfers  can  be  virtually  eliminated.  In 
such  an  approach,  the  memory  that  was  previously  the  exclusive  responsibil¬ 
ity  of  a  specific  PE  would  still  be  connected  to  that  PE  by  the  zuriie  channel 
and  one  of  the  read  channels.  However,  the  other  read  channel  would  be 
connected  to  a  memory  redirection  network  that  would  be  setable  by  the 
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control  unit  when  a  new  type  of  access  pattern  is  needed.  This  redirection 
network  could  be  either  bidirectional  or  (more  practical)  two  unidirectional 
networks,  one  direction  being  used  to  transmit  the  memory  addresses  and  the 
other  being  used  to  return  the  data.  The  advantage  of  using  two  unidirec¬ 
tional  networks  is  that  information  can  be  flowing  in  both  directions  at  the 
same  time  without  the  need  for  redirection  or  buffering.  This  would  allow  the 
memory  to  be  accessed  in  an  interleaved  manner,  further  improving  system 
performance.  When  this  scheme  is  compared  with  the  number  of  transfers 
needed  in  some  of  the  processing  steps  (such  as  in  hole  merging  and  Fourier 
descriptor  preparation),  the  possible  savings  are  evident. 

7.  SUMMARY 

In  this  paper,  analytic  and  simulation  results  for  the  application  of  parallel 
processing  to  the  computer  vision  task  have  been  presented.  Because  of  the 
modular  design  of  the  software  developed,  it  is  possible  to  expand  the 
processing  sequence  to  include  other  common  image  processing  techniques. 
From  the  analytic  and  simulation  capabilities  described,  given  specific  speed 
requirements  for  a  particular  vision  task  and  assumptions  about  processor 
speed,  it  will  be  possible  to  determine  the  number  of  processors  needed  to 
satisfy  the  task  requirements.  This  work  contributes  to  the  understanding  of 
the  design  of  parallel  systems  for  image  processing  applications. 
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Abstract 

Contour  extraction  is  used  as  an  image  processing 
scenario  to  explore  the  advantages  of  parallelism  and  the 
architectural  requirements  for  a  parallel  computer  system, 
such  as  PASM.  Parallel  forms  of  edge-guided  threshold¬ 
ing  and  contour  tracing  algorithms  are  developed  and 
analyzed  to  highlight  important  aspects  of  the  scenario. 
Edge-guided  thresholding  uses  adaptive  thresholding  to 
allow  contour  extraction  where  gray  level  variations 
would  not  allow  global  thresholding  to  be  effective. 
Parallel  techniques  are  shown  to  eliminate  some  types  of 
overhead  associated  with  serial  processing,  offer  the  possi¬ 
bility  of  improved  algorithm  capability  and  accuracy,  and 
decrease  execution  time.  The  implications  that  the  paral¬ 
lel  scenario  has  for  machine  architecture  are  considered 
Various  desirable  system  attributes  are  established 


1.  Introduction 

Image  processing  has  long  been  an  application 
viewed  as  suited  to  parallel  processing  U)  Many  indivi¬ 
dual  image  processing  algorithms  and  tneir  formulations 
for  parallel  processing  environments  have  been  studied, 
such  as  image  coding  |i7|,  image  correlation  |l.25|.  image 
segmentation  (6|,  two-dimensional  FFT  II8l  histogram- 
ming  |24|,  and  line  segment  generation  [26].  However,  lit¬ 
tle  work  exists  in  considering  a  scenario  as  a  whole  for 
parallel  processing.  One  such  scenario  is  contour  extrac¬ 
tion.  Contour  extraction  is  a  key  tool  for  use  in  applica¬ 
tions  ranging  from  computer  assisted  cartography  to 
industrial  inspection. 

in  the  past,  edge  information  has  been  used  to 
improve  threshold  selection  |15|  in  the  contour  extraction 
process.  A  new  scheme  for  determining  threshold  values 
has  been  developed  by  Suciu  and  Reeves  [28).  This 
scheme  has  been  incorporated  in  an  image  shape  analysis 
method  directed  toward  classifying  small  well-defined 
regions,  such  as  buildings  and  airplanes,  which  has  been 
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investigated  by  Mitchell,  Reeves,  and  Fu  |I6|.  A  process¬ 
ing  scenario  (composed  of  serial  algorithms)  which  pro¬ 
duces  interpretation  results  from  digitized  imagery  using 
these  methods  has  been  implemented  at  Purdue  Univer¬ 
sity  on  a  VAX  11/780.  In  this  application,  image  sizes 
are  typically  .SOOO-by-5000  piielt  (pirlure  elementt).  The 
image  is  analyzed  in  2-')6-by-2')6  pixel  subimages  which 
are  processed  independently.  To  insure  that  each  object 
(which  h.v<  a  maximum  dimension  of  127  pixels)  will  be 
completely  contained  within  at  lea.st  one  subimage,  it  is 
necessary  to  overlap  the  subimages. 

The  serial  method  of  |l6j  yields  good  results,  but  is 
computationally  intensive,  incurring  long  execution  times. 
The  time  required  to  complete  the  processing  scenario  ran 
be  reduced  by  exploiting  its  inherent  parallelism.  In  this 
Work,  a  processing  scenario  composed  of  parallel  algo¬ 
rithms  which  allows  the  problem  to  be  completed  with 
significantly  reduced  execution  time  is  considered.  In 
•addition  to  decreasing  the  processing  time,  the  parallel 
scenario  does  not  place  a  limit  on  the  maximum  size  of 
an  object  Once  it  has  been  constructed,  requirements 
the  |>arallel  scenario  imposes  on  the  architecture  of  a 
parallel  computer  system  such  as  PASM  [24]  are  studied. 

A  parallel  computer  system  model  is  given  in  Section 
II  In  .Section  III  the  object  shape  analysis  problem  |I6|  is 
defined  and  the  parallel  scenario  is  overviewed.  In  Sec¬ 
tions  IV  and  V  the  parallel  algorithms  which  compose  the 
scenario  are  presented,  and  they  are  evaluated  in  Section 
VI  The  implications  the  scenario  has  concerning  system 
architecture  are  considered  in  Section  VII. 

n.  SIMD/MIMD  Model 

An  SIMD/MIMD  machine  (e.g.,  CAIP  [12])  consists 
of  a  control  unit,  an  interconnection  network,  and  N  pro¬ 
cessing  elements  (PEs),  where  each  PE  is  a 
processor/memory  pair.  This  is  shown  in  I'ig.  I.  An 
SIMD/MIMD  machine  can  operate  in  either  SIMD  (single 
insirurtion  stream  ■  multiple  data  stream)  [8]  or  MIMI) 
(multiple  ins/ruefion  sfream  -  multiple  data  stream)  |8| 
modes  and  can  dynamically  switch  between  them.  When 
operating  in  SIMD  mode,  the  control  unit  broadcasts 
instructions  to  all  processors  and  each  active  processor 
executes  the  instructions  on  data  in  its  own  memory. 
The  same  instruction  is  executed  simultaneously  in  all 
active  processors.  The  interconnection  network  provides 
interprocessor  communication.  When  operating  in  MI\fD 
mode,  each  processor  fetches  instructions  from  its  own 
memory  and  executes  them  on  data  in  its  own  memory. 
In  MIMD  mode,  the  control  unit  may  coordinate  the 
activities  of  the  PEs.  A  partitionable  SIMD/MIMD  sys¬ 
tem  (e  g  ,  PASM  (24),  TRAC  [11,20|)  ran  be  dynamically 
reconfigured  to  operate  as  one  or  more  independent 
SIMD/MIMD  machines  of  varying  sizes.  In  this  paper 
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Fig.  1.  Mudpl  of  an  SIMD/MIMD  machine. 


I’A-SM  is  used  as  an  example  parallel  computer  system. 

f'ASM,  a  jiartitionable  SIMD/MIML)  machine,  is  a 
large-scale  dynamically  reconfigur^le  multimicrocom- 
(luler  system  being  designed  at  Purdue  L'niversity  |23,24|. 
Image  processing  and  pattern  recognition  tasks  are  the 
target  problem  domain  for  PASM,  and  the  requirements 
of  these  applications  are  being  used  to  guide  design  deci¬ 
sions  PASM  is  intended  to  be  a  flexible  research 
machine,  and  it  has  more  capability  than  is  necessary  to 
cope  with  the  example  image  processing  scenario  dis¬ 
cussed  in  this  paper.  In  particular,  PASM's  capability  to 
be  partitioned  to  operate  as  many  independent 
S1.MI)/M1M1>  machines  of  varying  sizes  is  not  needed  for 
this  scenario. 

The  rest  of  this  section  is  a  brief  overview  of  PASM 
I  )  provide  background  for  the  following  sections.  A 
tilock  diagram  showing  the  basic  components  of  PASM  is 
given  in  Fig.  2.  The  System  Control  Unit  is  a  cooven- 
lioiial  machine,  such  as  a  PUP-II,  and  is  responsible  for 
the  overall  coordination  of  the  activities  of  the  other  com¬ 
ponents  of  PASM  The  f’arallel  Computation  Unit  (PCU) 
■  oni  iiiis  N  -  o"  processors,  N  memory  modules,  and  an 
.iiterroniiecuon  network.  The  PCU  processors  are 
mil  roproce.ssors  that  perform  the  SIMl)  and  .MIMI)  com- 
Illations  The  I’CU  memory  modules  are  used  by  the 
PCI  processors  for  data  storage  ip  SIMD  mode  and  both 
'ala  and  instruction  storage  in  MIMD  mode  PASM  is 
l.cing  designed  for  N  =  1024  An  N  =  16  prototype 
'  a-ed  on  Motorola  MCSSOOO  procissors  is  planned  [13] 


F'lg  2  block  diagram  overview  of  I’ASM. 


Fig.  3.  F'ASM  Parallel  Computation  Unit. 


The  PCU  is  organized  as  shown  in  Fig.  3.  A  pair  of 
memory  units  is  used  for  each  PCU  memory  module  so 
that  data  can  be  moved  between  one  memory  unit  and 
secondary  storage  while  the  PCU  processor  operates  on 
data  in  the  other  memory  unit  (double-buffering).  Each 
memory  unit  is  of  substantial  size  (e.g.,  64K  words).  A 
processor  and  its  associated  memory  module  form  a  PCU 
I’U.  The  PCU  PEs  are  addressed  (numbered)  from  0  to 
N-|.  The  intereonneetion  network  provides  a  means  of 
coinniunication  among  the  PE^.  PASM  will  use  either  an 
Isxtra  Stage  Cube  type  |2,22|  or  Augmented  Data  Mani¬ 
pulator  i>pe  |14,21|  of  multistage  network.  The  Memory 
Management  System  controls  the  loading  and  unloading 
of  the  PCU  memory  modules  from  the  multiple  secondary 
storage  devices  of  the  Memory  Storage  System. 

The  Micro  Controllers  (MCs)  are  a  set  of  micropro¬ 
cessors  which  act  as  the  control  units  for  the  PEs  in 
.SIMI>  mode  and  orchestrate  the  activities  of  the  PEs  in 
MIMI)  mode  Control  Storage  contains  the  programs  for 
the  MCs. 

Ql.  Image  Processing  Task 

.4  Problem  Dejxmtion  and  Serial  Algorithms 

The  first  stage  of  the  shape  analysis  scenario  of  Il6| 
IS  to  identify  boundaries  of  potential  objects  using  eage- 
guided  thresholding  |28|.  Edge-guided  thresholding 
(lOCT)  uses  adaptive  thresholding  to  allow  contour 
extraction  where  gray  level  variations  would  not  allow 
global  thresholding  to  be  effective.  The  image  is  seg- 
meiiied  by  selecting  several  gray  level  thresholds  and 
tracing  the  resulting  contours.  ClassiGcation  is  accom¬ 
plished  by  comparing  the  contours  with  prototype  object 
models  using  either  F'ourier  descriptors  [30]  or  standard 
iinmients  [10,20] 

An  overview  of  the  serial  image  processing  scenario 
follows  (further  details  are  given  later  in  this  section). 
Segmentation  is  simplest  when  there  is  little  background 
information,  i  e  ,  the  objects  of  interest  cover  a  signiGcant 
portion  of  the  image.  To  achieve  this  with  a  very  large 


image,  the  image  can  be  divided  into  subimages.  A 
subimage  size  twice  the  largest  dimension  of  an  object  is 
chosen,  and  each  subimage  is  processed  independently. 
Subimages  art  located  so  that  they  overlap  neighboring 
subimages  50  percent  in  both  the  horizontal  and  vertical 
direction.  This  insures  that  an  object  will  be  completely 
contained  in  at  least  one  block.  However,  it  is  necessary 
to  perform  the  image  processing  computations  four  times 
for  each  pixel.  The  advantage  of  this  method  is  that  it 
eliminates  the  need  to  trace  contours  across  subimage 
boundaries  (simplifying  the  algorithms)  and  significantly 
reduces  the  amount  of  main  memory  required  (subimages 
are  discarded  after  processing). 

Potential  thresholds  for  a  subimage  are  selected 
using  edge-guided  thresholding,  which  selects  thresholds 
based  on  an  edge-matching  criterion.  Using  the  Sobel 
edge  operator  [7|,  an  edge  image  is  generated  in  which 
gray  levels  indicate  the  magnitude  of  the  gradient.  A 
figure  of  merit  which  indicates  how  well  a  given  thres- 
holded  gray  level  image  matches  edges  in  the  edge  image 
is  then  computed  for  every  possible  threshold.  Using 
thresholds  with  high  figures  of  merit,  a  requantized  ver¬ 
sion  of  the  gray  level  image  is  generated.  A  median  filter 
IflJ  may  then  be  applied  to  remove  isolated  noise  artifacts. 
Tne  contours  for  all  potential  objects  not  touching  the 
subimage  boundary  (i.e.,  completely  contained  within  the 
subimage)  are  extracted  for  further  shape  analysis.  Very 
short  and  very  long  contours  may  not  be  retained  if  they 
represent  objects  outside  the  range  of  interest.  The 
boundary' of  each  object  (contour)  is  stored  as  a  sequence 
of  x-y  coordinates. 

B.  Parallel  Scenario 

In  this  section  a  parallel  formulation  of  the  contour 
extraction  scenario  is  presented.  This  parallel  scenario 
will  be  used  as  an  application  example  for  determining 
the  execution  environment  which  must  be  provided  by 
the  architecture  of  an  SIMD/MIMD  parallel  processing 
system  such  as  PASM.  The  specific  context  of  the  con¬ 
tour  extraction  scenario  would  depend  on  the  application. 
The  contour  extraction  scenario  may  be  preceded  by 
image  processing  such  as  rectification.  Subsequent  use  of 
the  extracted  contours  depends  on  the  particular  end 
application.  Highlighting  contours  of  an  image  requires 
essentially  no  further  processing,  while  shape  analysis  and 
classification  may  involve  significant  additional  calcula¬ 
tion  beyond  contour  extraction. 

An  M-by-M  pixel  image  is  represented  by  an  array  of 
M*  pixels,  where  the  value  of  each  pixel  is  assumed  to  be 
an  eight-bit  unsigned  integer  representing  one  of  256  pos¬ 
sible  gray  levels.  To  implement  contour  extraction  on  an 
SIMD/MIMD  machine  of  1024  PEs,  assume  that  the  PEs 
are  logically  configured  as  a  32-by-32  grid,  on  which  the 
M-by-M  image  is  superimposed,  i.e.,  each  processor  has 
an  M/32-by-M/32  subimage  (see  Fig.  4(a)).  For 
M  =  5120,  each  PE  stores  a  I60-by-100  subimage.  Each 
pixel  is  uniquely  addressed  by  its  i-x-y  coordinates,  where 
X  and  y  are  the  x-y  coordinates  of  the  pixel  in  the  subim¬ 
age  contained  in  PE  i. 

Two  important  parallel  algorithms  of  the  contour 
extraction  scenario  are  edge-guided  thresholding  and  con¬ 
tour  tracing  The  edge-guided  thresholding  algorithm, 
discussed  in  Section  IV,  is  used  to  determine  a  set  of 
optimal  thresholds  for  each  subimage  The  contour  trac¬ 
ing  algorithm,  which  is  considered  in  Section  V,  uses  the 
set  of  optimal  thresholds  to  segment  the  image  and  trace 
the  contours,  generating  an  i-x-y  sequence  for  each  con¬ 
tour 

The  parallel  algorithms  described  yield  a  significant 
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Fig.  4.  (a)  Data  allocation  for  a  5120-by-5120  image 
using  1024  PEs. 

(b)  Data  transfers  needed  to  apply  Sobel  edge 
operator. 

reduction  in  execution  time  because  the  multiplicity  of 
processors  allows  all  of  the  subimages  to  be  processed 
simultaneously.  .Since  the  parallel  contour  tracing  algo¬ 
rithm  IS  able  to  trace  contours  over  subimage  borders,  it 
is  not  necessary  to  overlap  the  subimages,  and  each  pixel 
is  processed  only  once.  The  parallel  algorithms  can  result 
in  improved  information  extraction  since  the  subimages 
can  he  smaller  (assuming  a  large  number  of  PEs),  yielding 
a  better  choice  of  thresholds  within  each  subimage.  In 
addition,  the  parallel  algorithms  do  not  require  an  object 
to  he  contained  in  a  single  suhimage. 

The  parallel  scenario  could  be  implemented  on  a 
serial  computer  system  with  virtual  memory  [4].  The 
disadvantage  of  this  approach  is  that  when  a  contour 
spans  more  than  one  subimage,  the  linking  of  partial  con¬ 
tours  residing  in  different  subimages  requires  that  a 
representation  of  the  subimages,  as  well  as  any  contour 
information,  be  accessible.  This  may  result  in  significant 
delay  due  to  paging  subimages  into  primary  memory 
Paging  overhead  does  not  occur  on  a  parallel  system  since 
(he  entire  image  is  stored  in  primary  memory.  Thus,  it  is 
(he  multiplicity  of  primary  memories  in  a  parallel  system 
such  as  PASM  (the  large  primary  memory  space)  that 
iii.ikes  the  non-overlapping  subimage  approach  practical 

rv.  Edge-Guided  Thresholding 

The  first  major  procedure  of  the  example  scenario  is 
edge-guided  threeholding  (EGT)  |28J,  which  is  used  to 
identify  boundaries  of  possible  objects.  Edge-guided 
thresholding  selects  threshold  levels  based  on  an  edge- 
matching  criterion  instead  of  the  classical  technique  of 
image  histogram  local  minimum  values  JlOj.  Frequently, 
E<iT  gives  better  results  than  the  histogram  method 
because  it  is  able  to  detect  small  regions  not  discernibly 
represented  in  the  histogram  [28]. 

The  E(;T  algorithm  operates  on  each  subimage 
independently,  and  consists  of  three  major  steps.  First 
an  edge  image  is  generated.  Then  a  figure  of  merit  is 
computed  for  every  possible  threshold.  Finally,  local 
maxima  (peaks)  in  the  figure  of  merit  function  determine 
the  threshold  levels. 

The  Sobel  edge  operator  is  used  to  generate  the  edge 
image  in  the  example  scenario.  SIMD  parallelism  is  the 
most  advantageous  form  of  parallelism  for  the  Sobel  algo¬ 
rithm  This  can  be  shown  by  analysis  of  the  operator 
itself  Let  the  image  I  be  M-by-M  and  l(x,y)  be  a  gray 
level  image  pixel,  where  0  <  x,  y  <  M-1.  The  Sobel 
procedure  (ignoring  image  edge  pixels  for  clarity)  is  the 
following. 


for  X  =  I  to  M-2  do 
for  y  =  1  10  M-2  do 

sx(x,y)  =  -j-KKx-l.y-U  +  iMIx-l.yl  +  Kx-l.y +  U) 

-(l(x  +l,y-l)  +  2‘I(x  +  I,y)  +  I(x  +l,y  +  l)(] 

=  -|(I(x-l,y-l)  +  2*l(x,y-l|  +  l(x  +  l,y-l)) 

4 

-fl(x-l,y  +  l)  +  2*l(x,y  +  1)  +  l(x  +  l,y  +  l))| 
g(x,y)  =  \/sx(x,y)’  +  sy(x,y)* 


dari<-s  and  the  edges  detected  by  the  Sobel  operator,  lo 
avoid  the  assignment  of  a  high  6gure  of  merit  to  a  small 
number  of  noise  pixels,  a  bias  can  be  added  to  the 
denominator  when  calculating  the  mean.  This  has  the 
effect  of  lowering  the  figure  of  merit  if  only  a  small 
number  of  pixels  are  above  the  threshold.  The  gray  lev¬ 
els  associated  with  local  maxima  (peaks)  in  the  figure  of 
merit  function  are  chosen  for  image  segmentation.  Typi¬ 
cally,  three  to  six  levels  are  chosen.  The  next  step  of  the 
scenario  is  contour  tracing. 


The  value  g(x,y)  represents  the  gradient  at  pixel  (x,y), 
and  these  values  form  the  edge  image.  The  M-by-M 
image  in  the  Sobel  operator  definition  corresponds  to  a 
subimage  within  a  PE  for  the  scenario. 

The  algorithm  is  particularly  well  suited  for  SIMD 
parallelism  because  all  pixels  are  processed  identically. 
This  complete  synchronization  aids  the  PE-to-PE  com¬ 
munication  necessary  when  subimage  border  pixels  within 
each  PE  must  be  processed.  In  the  case  of  this  algorithm, 
(ransmission  delays  incurred  due  to  PE^to-PE  data 
transfers  can  be  overlapped  with  data  processing  to 
reduce  total  execution  time.  All  PEs  will  simultaneously 
request  the  same  border  pixel  relative  to  their  subimages. 
For  example,  when  processing  begins  (with  the  upper  left 
corner  subimage  pixel)  all  PEs  will  request  (from  the  PE 
to  their  upper  left)  the  pixel  immediately  above  and  to 
the  left  of  their  upper  left  corner  pixel  (if  this  pixel  is 
within  the  complete  image).  This  transfer  of  data  from 
upper  left  neighbors  can  occur  for  all  PEs  simultaneously. 
A  total  of  4*1 160  +  1)  =  644  parallel  transfers  are 
needed  for  a  SI20-by-SI20  pixel  image,  as  shown  in  Fig. 
4(1)).  The  candidate  interconnection  networks  for  PASM 
t  an  support  these  parallel  transfers  from  any  neighboring 
PE.  The  result  of  the  Sobel  operator  is  the  edge  image. 
High  edge  image  pixel  values  indicate  the  presence  of  an 
edge. 

The  next  step  of  the  ECT  algorithm  is  to  compute  a 
figure  of  merit  value  for  each  possible  gray  level.  The 
iigure  of  merit  is  a  measure  of  how  well  the  edges  gen¬ 
erated  by  a  given  threshold  match  the  edges  detected  by 
the  Sobel  operator.  Specifically,  the  figure  of  merit  is 
determined  as  follows. 

1.  The  local  maximum  and  minimum  pixel  values 
over  a  3-by-3  window  are  determined  for  each 
gray  level  image  pixel. 

2.  For  each  possible  threshold  value  (i.e.,  all  gray 
levels)  the  ceiiitr  pixel  of  the  3-by-3  window  is 
tested  to  see  if  it  is  an  edge  point.  It  is  an  edge 
point  if  the  threshold  is  greater  than  or  equal  to 
the  local  minimum  and  less  than  the  local 
niaximuiii. 

3  The  mean  of  the  edge  image  pixels  corresponding 
lo  the  gray  level  image  pixels  found  to  be  edge 
point.s  at  a  given  threshold  is  the  figure  of  merit 
for  that  threshold. 

The  figure  of  merit  calculation  has  portions  suited  to 
both  .SLVff)  and  MIMD  parallelism  Steps  1  and  2  can  be 
done  efficiently  in  SIMD  mode  since  all  pixeb  are  pro¬ 
cessed  similarly  Step  3  is  executed  only  on  the  gray  level 
image  pixels  which  are  edge  points.  To  do  this,  the  PEs 
operate  in  .MIMD  mode,  each  sequencing  through  the 
rilge  points  in  its  subimage.  Since  the  number  of  such 
pixels  may  vary,  some  PEs  may  complete  Step  3  before 
others 

The  greater  the  mean  of  the  edge  points  in  Step  3, 
the  better  the  match  between  threshold-generated  boun- 


V.  Contour  Tracing 

In  this  section  an  approach  to  performing  contour 
tracing  using  MIMD  parallelism  b  presented.  Initially, 
each  PE  contains  a  Ibt  of  threshold  values,  {T|,T2,...,Tt}, 
for  its  subimage  which  have  been  selected  using  edge- 
guided  thresholding.  The  number  of  thresholds  for  any 
given  PE  b  denoted  by  t  and  can  differ  for  each  PE.  The 
contour  tracing  algorithm  has  two  phases.  In  Phase  I, 
the  subimage  is  segmented  within  each  PE  and  all  local 
contours  (both  closed  and  partial)  are  traced  and 
recorded.  In  Phase  D,  the  partial  contours  traced  during 
Phase  I  are  connected. 

A  contour  table  is  constructed  in  each  PE  containing 
an  entry  for  every  contour,  whether  partial  or  closed, 
which  is  located  in  the  subimage  associated  with  that  PE. 
Each  contour  table  entry  contains  the  following  fields:  (a) 
a  contour  identification  number,  (b)  the  threshold  value 
which  generated  the  contour,  (c)  the  number  of  pixeb  in 
(III  contour,  (d)  a  flag  indicating  if  the  contour  »  closed 
or  partial,  (e)  a  pointer  to  the  array  containing  the  i-x-y 
sequence  of  the  contour,  (f)  a  flag  indicating  whether  the 
partial  contour  has  been  connected  (for  use  in  Phase  D), 
(g)  the  physical  address  of  the  PE  which  linked  the  con¬ 
tour,  (h)  the  physical  PE  address  and  identification 
number  denoting  the  partial  contour  blocking  extension 
of  the  contour,  and  (i)  a  locked/unlocked  semaphore. 
Contour  table  entries  g,  h,  and  i  are  discussed  below. 
ICach  PE  also  contains  a  partial  confour  list.  Thb  Ibt  has 
an  entry  for  each  partial  contour  containing  the  i-x-y 
coordinates  of  its  two  end  points  and  a  pointer  to  its  con¬ 
tour  table  entry. 

In  Phase  I  there  b  no  PE-to-PE  communication. 
Elarb  PE  considers  its  threshold  values  Tj,  1  <  i  <  t, 
independently.  Its  subimage  b  segmented  using  each 
threshold  level  T;.  To  create  the  segmented  image  for 
threshold  Ti.  pixeb  in  the  original  image  which  have  a 
value  greater  than  or  equal  to  Tj  are  assigned  a  value  of 
one,  while  those  which  are  less  than  the  threshold  are 
assigned  a  value  of  zero. 

Contour  tracing  begins  by  scanning  rows  of  the  seg¬ 
mented  image  beginning  with  the  top  row.  Scanning 
stops  when  a  pixel  with  value  one  is  found  which  has  a 
zero-valued  neighbor  to  cither  side.  Thb  pbel  b  marked 
as  the  itart  point  of  a  new  contour,  and  its  i-x-y  coordi¬ 
nates  are  stored  Consider  thb  pixel  as  the  center  pbel  of 
the  3-by-3  window  in  Fig.  S.  The  contour  b  traced  in  a 
counterclockwise  direction  generating  a  sequence  of  i-x-y 
ctHirdinates.  Beginning  with  the  neighboring  pbel  in 
position  five  (see  F'ig.  5)  and  incrementing  by  1  modulo  8 
to  determine  the  next  pbel,  the  algorithm  looks  for  a 
pixel  which  has  a  value  of  one.  The  algorithm  stores  the 
direction,  p,  of  this  new  pbel  and  appends  its  i-x-y  coor¬ 
dinate  to  the  contour  sequence.  Treat  thb  new  pbel  as 
the  center  point  of  the  3-by-3  window  in  Pig.  S.  The 
algorithm  then  looks  for  the  next  pbel  in  the  contour 
beginning  with  the  pbel  in  position  (p  -E  S)  modulo  8  (to 
pr^uce  a  counterflockwbe  trace).  Tracing  continues 
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Fig.  5.  Namiog  convention  for  the  neighbors  of  the 
center  pixel  in  a  3-by-3  window. 


until  the  start  point  or  a  point  of  indecision  is  reached.  If 
all  of  the  neighbors  of  a  start  point  are  zero,  that  pixel  is 
an  isolated  point  and  is  ignored. 

A  point  of  indecision  occurs  when  information  from 
an  adjacent  subimage  is  required  to  determine  the  direc* 
tion  of  the  contour.  When  a  point  of  indecision  is 
reached,  it  is  recorded  as  an  end  point,  and  the  algorithm 
returns  to  the  start  point  to  trace  the  contour  in  a  clock¬ 
wise  direction  until  another  point  of  indecision  is  reached. 
When  tracing  in  the  clockwise  direction,  the  new  contour 
pixels  are  inserted  onto  the  front  of  the  i-x-y  sequence. 
Each  pixel  in  the  contour  is  marked  in  the  thresholded 
image  so  that  the  contour  will  not  be  retraced. 

Consider  the  following  contour  tracing  example 
baaed  on  Fig.  6.  A  10-by-^  image  is  divid^  into  two 
lO-by-10  subimages;  each  subimage  is  loaded  into  one  of 
two  PEs.  The  local  threshold  value  T|  is  applied  to  the 
subimage  in  each  PE.  Each  PE  i  begins  scanning  its 
respective  subimage  at  pixel  (1,0,0),  for  a  one  (indicated 
by  a  dot)  with  a  zero  on  eitner  side.  PE  0  locates  the 
edge  of  a  segmented  object  at  pixel  (0,3,3).  Pixel  (0,3,3) 
is  the  start  point  for  the  new  contour.  PE  0  traces  the 
contour  of  the  object  counterclockwise  to  a  point  of  inde¬ 
cision  at  pixel  (0,7,0),  which  is  recorded  as  an  end  point. 
Pixel  (0,7,0)  is  a  point  of  indecision  since  pixels  (1,6,0), 
(1,7,0),  and  (1,8,0)  of  the  subimage  in  PE  I,  which  could 
extend  the  contour,  are  not  in  the  subimage  contained  by 
PE  0.  PE  0  then  traces  the  contour  in  the  clockwise 
direction  beginning  at  pixel  (0,3,3),  reaching  a  point  of 


O  Start  point 

-O  Countere  lock  wise  trace  mark 
O-  Clockwise  trace  mark 
4  End  point  (counterclockwise) 
^  End  point  (clockwise) 


indecision  at  pixel  (0,3,0).  After  the  clockwise  trace,  Ihc 
first  pixel  in  the  i-x-y  sequence  describing  the  contour  is 
(0,3,01.  PE  0  resumes  scanning  at  pixel  (0,3,4)  and  finds 
no  otner  contours  in  its  subimage.  Note  that,  for  exam¬ 
ple,  pixel  (0,4,4)  is  not  a  start  point  for  a  new  contour 
since  it  was  marked  during  the  trace  of  the  first  contour. 
•Similarly,  a  partial  contour  b  located  in  PE  1  with  (1,7,0) 
a.s  the  first  pixel  in  its  i-x-y  sequence.  Once  a  PE  has 
scanned  the  segmented  image  generated  by  threshold  T,, 
it  repeats  the  process  for  threshold  T^.^.  After  all  thres¬ 
hold  values  in  a  PE  have  been  considered,  Phase  I  is  com¬ 
plete. 

In  Phase  II,  each  PE  attempts  to  connect  its  partial 
contours  to  partial  contours  which  are  located  in  neigh¬ 
boring  PEs.  There  are  two  alternatives  for  determining 
when  a  PE  ran  enter  Phase  0.  With  the  first.  PFb  are 
allowed  to  start  Pha.se  11  processing  after  all  have  com¬ 
pleted  Phase  I.  With  the  second,  a  PE  enters  Phase  II 
immediately  after  completing  Pha.se  I.  However,  it  can 
only  attempt  to  extend  contours  into  subimages  of  PEs 
which  are  also  in  Pha.se  0.  If  all  neighboring  PEs  are  still 
in  Pha.se  I,  the  PE  must  wait.  The  latter  approach  may 
reduce  the  total  scenario  execution  time  since  the  PE 
with  the  longest  Phase  I  time  may  well  not  be  the  one 
with  the  longest  Phase  IT  lime.  The  first  alternative 
requires  time  equal  to  the  sum  of  the  longest  times  in 
each  phase. 

Since  multiple  PEs  can  contain  portions  of  the  same 
contour,  there  must  be  a  rule  to  determine  which  PEs 
have  priority  to  attempt  to  close  a  contour.  The  rule  is 
each  PE  attempts  to  extend  only  its  partial  contours 
which  have  both  end  points  bordering  subimages  to  the 
left  and/or  above.  For  example,  in  Fig.  7,  partial  con¬ 
tours  A,  B,  C,  and  D  are  considered  by  the  PE,  while  E, 
F,  and  G  are  not.  For  each  given  partis)  contour  (gen¬ 
erated  by  a  threshold  T;),  the  PE  attempts  to  extend  it 
into  (he  neighboring  PE  from  the  counterclockwise  end 
point  (as  described  below). 

In  order  for  a  PE  to  extend  a  contour,  it  must  be 
able  to  access  and  modify  contour  tables  which  are 
located  in  other  PEs.  As  a  result,  a  mechanism  to 
prevent  one  PE  from  using  a  contour  table  entry  while 
another  PE  is  in  the  process  of  using  that  entry  must  be 
provided  by  the  system  and  used  by  the  contour  tracing 
algorithm.  Any  section  of  code  which  modifies  a  contour 


Fig.  6.  Example  of  Phase  I  contour  tracing  for  a  lO-by- 
20  image.  The  triple  (i,x,y)  represents  the  i-x-y 
coordinates  of  a  pbel. 


Fig.  7.  Pha.se  II  connection  precedence.  Partial  contours 
A,  n,  C,  and  D  are  considered  by  the  PE;  E,  F, 
and  G  are  not. 
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table  entry  is  a  crilital  section  |S|.  The  only  table  entry 
Gelds  which  can  be  modiGed  by  another  PE  are  the  flag 
which  indicates  if  the  partial  contour  has  been  connected 
and  the  physical  address  of  the  PE  which  linked  the  con¬ 
tour  (Gelds  (f)  and  (g)).  While  a  critical  section  is  being 
executed  on  a  given  table  entry,  that  entry  is  locked,  so 
no  other  processor  can  modify  it. 

A  semaphore  is  a  variable  whose  value  indicates 
whether  or  not  a  critical  section  can  be  entered  |S]. 
There  is  a  semaphore  for  each  contour  table  entry  wbicn 
ran  take  on  a  value  of  zero  or  one.  Before  a  PE  enters  a 
critical  section  (for  a  given  contour),  the  processor  per¬ 
forms  a  P-operation  [5|  on  the  given  contour  to  determine 
if  it  is  unlocked.  If  the  semaphore  for  the  contour  table 
entry  is  one,  the  processor  sets  the  semaphore  to  zero 
(locking  the  contour  table  entry  so  that  no  other  proces¬ 
sor  can  access  it)  and  enters  the  critical  section,  free  to 
modify  the  contour  table  entry.  When  the  processor 
completes  modiGcation  of  the  contour  table  entry  (i.e., 
the  critical  section  ends),  it  performs  a  V-operalion  |5|  on 
the  semaphore  for  the  contour,  setting  the  semaphore  to 
one.  The  contour  table  entry  is  then  unlocked.  On  the 
other  hand,  if  the  semaphore  is  initially  zero,  the  proces¬ 
sor  receives  a  message  indicating  that  the  partial  contour 
is  locked. 

If  the  end  point  of  a  given  partial  contour  is  not  at  a 
corner  of  its  subimage,  there  are  three  pixels,  located  in 
the  adjacent  subimage,  which  can  possibly  extend  the 
contour.  The  PE  accesses  the  partial  contour  list  for  the 
adjacent  subimage  (see  Section  NTl).  Considering  the 
possible  extending  pixels  one  at  a  time  in  counterclock¬ 
wise  order,  the  PE  checks  the  partial  contour  list  to 
determine  if  any  partial  contours  in  the  adjacent  subim¬ 
age  have  the  possible  extending  pixel  as  an  end  point.  If 
such  a  partial  contour  exists,  the  PE  performs  a  P- 
operation  on  the  contour  table  entry  pointed  to  by  the 
partial  contour  list.  If  the  contour  was  unlocked,  the  i- 
x-y  secjuence  for  the  contour  is  transferred  (discussed  in 
Section  Vll)  to  the  PE  containing  the  given  partial  con¬ 
tour  and  then  concatenated  to  its  i-x-y  sequence,  forming 
a  new,  extended  partial  contour.  If  there  is  more  than 
one  partial  contour  with  the  same  end  point  which  can 
extend  the  given  contour,  the  partial  contour  which  was 
generated  by  a  threshold  value  closest  to  that  for  the 
given  contour  is  selected. 

If  the  end  point  of  a  given  partial  contour  is  a  corner 
point  of  its  subimage,  there  are  Gve  pixels  located  in 
adjacent  subiniages  which  can  possibly  extend  the  con¬ 
tour  Since  these  Gve  pixels  are  located  in  three  different 
subimages,  the  PE  attempting  to  extend  the  given  partial 
contour  must  check  for  continuation  in  each  of  the 
upper-left  adjacent  subimages  (in  a  counterclockwise 
order) 

Note  that  regardless  of  where  partial  contour  end 
points  lie.  the  .search  for  pixels  to  extend  the  contour  can 
be  widened  beyond  the  three  or  Gve  pixels  here  to  allow 
for  threshold  value  discontinuities  at  subimage  boun¬ 
daries.  Thresholds  could  be  interpolated  across  subimage 
boundaries  to  allow  partial  contours  with  non-adjacent 
end  points  to  be  joined 

Assume  that  PE  i  has  a  partial  contour  which  it  is 
responsible  for  extending.  If  a  continuation  of  the  partial 
contour  IS  not  found  in  the  partial  contour  list  for  the 
ailjacent  subimage,  PE  i  probes  into  the  adjacent  subim¬ 
age  to  determine  if  an  extension  of  tbe  partial  contour 
can  be  generated  by  the  threshold,  T,  it  (PE  i)  used  to 
trace  its  partial  contour.  If  so,  PE  i  extends  its  partial 
contour  by  accessing  the  data  from  the  adjacent  PE. 
Instead  of  creating  an  entire  segmented  subimage  for  tbe 


threshold  T,  PE  i  dynamically  thresholds  pixels  as 
needed.  This  contour  generation  using  T  is  done  since  it 
is  possible  that  the  partial  contour  in  tbe  adjacent  PE 
was  not  located  in  Phase  I  because  different  threshold 
values  were  used,  or  (be  contour  fell  along  the  edge  of  the 
subimage  (see  the  split  between  PEs  2  and  3  in  Pig.  0). 

Once  PE  i  locates  a  partial  contour  in  an  adjacent 
subiinage  which  continues  the  given  contour  and  has 
stored  the  concatenated  contour  in  its  contour  table,  it 
repeats  tbe  process,  if  necessary,  by  following  tbe  contour 
to  tbe  next  PE  until  tbe  contour  is  closed  or  cannot  be 
extended  A  limit  is  placed  on  tbe  maximum  contour 
length  to  guarantee  algorithm  termination  in  tbe  event  of 
a  pathological  image. 

Consider  tbe  example  in  Fig.  8  where  a  12-by-12 
pixel  image  is  divided  between  four  PEs.  After  Phase  I, 


O  Pixels  traced  ia  Phase  I 


Fig.  8.  Example  where  two  PEs  attempt  to  close  tbe 
same  contour.  End  point  coordinates  are  given 
where  (i,x,y)  represents  tbe  i-x-y  coordinates  of 
the  pixel. 

PE  0  contains  partial  contours  A  with  end  points  (0,1, M 
and  (0,5,1)  and  B  with  end  points  (0,4,5)  and  (0,5,3);  PE 
I  contains  partial  contour  C  with  end  points  (1,4,0)  and 
(1.1,0);  and  PE  2  contains  partial  contour  D  with  end 
points  (2,0,1)  and  (2,0,3).  Since  both  end  points  for  con¬ 
tour  C  border  tbe  subimage  to  the  left,  PE  1  attempts  to 
extend  contour  C  in  Phase  11.  Similarly,  PE  2  attempts 
to  extend  contour  D  since  its  end  points  border  the 
subimage  above. 

PE  1  attempts  to  extend  C  in  the  counterclockwise 
direction,  i.e  ,  from  pixel  (1,1,0).  It  Grst  locks  its  contour 
(able  entry  for  C.  It  then  examines  tbe  contour  table  of 
I’K  0  and  determines  that  A  can  be  linked  to  C.  If  the 
(able  entry  for  A  is  unlocked  (i.e.,  the  semaphore  value  is 
one),  PE  I  locks  it  (performs  a  P-operation)  and  appends 
(he  i-x-y  sequence  of  A  to  the  i-x-y  sequence  of  C.  It  also 
sets  the  flag  which  indicates  that  A  has  been  linked  and 
records  that  PE  I  performed  the  linkage. 

Independently  of  the  actions  of  PE  1,  PE  2  attenmts 
to  extend  contour  D  (from  pixel  (2,0,3)).  As  did  PE  1 
with  A,  PE  2  appends  B  to  D.  If  PE  2  attempts  to 
extend  the  result,  DB,  while  PE  1  is  in  the  process  of 
extending  C  into  PE  0,  it  will  Gnd  C  locked.  PE  2  then 
abandons  its  attempt  to  close  the  contour,  since  PE  1  is 
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also  attempting  to  do  it,  and  unlocks  partial  contour  DB. 
This  allows  PE  1  to  access  DB  after  it  has  appended  A  to 

C.  Therefore,  the  cbsed  contour  CADB  is  ultimately 
traced  completely  and  stored  by  PE  I.  If  PE  1  had  com¬ 
pleted  linking  A  to  C  before  PE  2  completed  linking  B  to 

D,  the  closed  contour  would  have  been  completely  traced 
by  PE  2.  Deadlock  is  the  situation  when  each  of  two  or 
more  PEs  are  halted  while  waiting  for  the  other(s)  to  con¬ 
tinue  |27].  If  a  PE  is  blocked  due  to  a  lock  then  (I)  not 
allowing  a  PE  to  wait  for  access  to  a  locked  contour  table 
entry  of  another  PE,  and  (2)  requiring  the  blocked  PE  to 
unlock  its  affected  partial  contour  prevents  deadlock. 

If  PE  1  and  PE  2  had  completed  their  first  linking 
operation  timullantouely,  both  would  have  abandoned 
tracing  the  contour  (i.e.,  no  PE  would  link  the  contour 
CADB).  To  insure  that  the  linking  of  a  contour  will  not 
be  abandoned  by  all  PEs,  the  following  protocol  is  used. 
Assume  PE  i  is  blocked  from  extending  a  contour  X  by 
PE  j,  which  has  higher  positional  precedence  (i.e.,  i  <  j). 
In  that  case,  PE  i  unlocks  contour  X  and  sends  a  message 
informing  PE  ]  that  PE  i  has  abandoned  its  attempt  to 
further  extend  contour  X.  If  PE  j  had  also  abandoned 
the  contour,  this  message  would  cause  PE  j  to  try  again. 
The  message  sent  from  PE  i  to  PE  j  contains  the 
identification  number  of  contour  X  and  the  value  i.  After 
receiving  the  message,  PE  j  searches  its  contour  table  to 
determine  if  it  abandoned  X.  To  do  this  it  uses  held  (h) 
of  the  contour  table.  For  the  above  example  PE  2  would 
link  the  p.irtial  contours  since  it  has  higher  precedence 
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Fig.  0.  Results  of  Phase  I  of  contour  tracing  for  a  30-by- 
20  subimage. 


Deadlock  with  multiple  contours  cannot  occur  since 
each  PE  considers  only  one  contour  at  a  time  and  does 
not  abandon  the  attempt  to  extend  that  contour  until 
that  PE  has  closed  the  contour  or  has  relinquished  con¬ 
trol  to  another  Pt;  to  close  that  contour. 

When  Phase  II  of  the  algorithm  is  complete,  the  i-x-y 
sequence  for  each  contour  in  the  image  will  be  contained 
in  exactly  one  of  the  PEs  which  contained  part  of  the 
contour  originally.  In  the  example  given  in  Fig  6,  PE  I 
will  contain  i-x-y  sequence  for  the  contour. 

As  a  final  example,  a  30-by-20  image  is  divided  into 
six  lO-by-IO  subimages,  each  subimage  is  loaded  into  one 
of  six  PEs.  In  Figs.  0  and  10  the  results  of  Phase  I  and  II 
processing  are  shown,  respectively.  Even  though  the 
entire  object  in  PE  5  was  located  within  the  subimage, 
the  left  edge  of  the  object  was  not  traced  in  Phase  I  since 
PE  5  could  not  determine  whether  the  object  continued 
into  the  next  subimage.  On  the  other  hand,  a  closed  con¬ 
tour  was  found  in  Phase  I  for  the  object  in  PE  4  since  the 
object  did  not  include  any  border  pixels  of  the  subimage. 

VI.  Algorithm  Evalnation 

Subimage  size  for  the  serial  algorithm  is  chosen  to  be 
twice  the  maximum  allowed  object  dimension  so  that 
overlapping  of  subimages  guarantees  that  each  object 
appears  in  it^  entirety  in  some  subimage.  With  this  pro¬ 
perty,  partial  contours  never  need  to  be  considered;  all 
objects  are  found  as  closed  contours  within  a  subimage. 
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Fig  10.  Results  of  Phase  D  of  contour  tracing  for  a  30- 
by-20  subimage. 
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The  advantage  of  the  serial  approach  (Section  III. A)  over 
the  parallel  approach  (Section  111.13)  is  that  partial  con¬ 
tour  extension  is  not  necessary.  The  disadvantages  of  the 
serial  approach  when  compared  to  the  parallel  approach 
are  threefold.  First,  the  maximum  size  of  an  object  of 
interest  must  be  established  so  that  subimage  size  is 
known.  This  choice  is  constrained  by  the  fact  that  EOT 
performance  tends  to  degrade  with  increasing  subimage 
.size  Thus,  there  is  a  practical  limit  on  the  maximum 
object  size.  Second,  each  pixel  is  processed  for  contour 
extraction  four  times.  Finally,  thresholding  (including 
IXiT)  tends  to  perform  less  well  when  objects  are  small 
relative  to  the  image  (in  this  case,  subimage)  size.  The 
parallel  algorithms  do  not  limit  maximum  object  size, 
process  each  pixel  just  once,  and  may  improve  threshold 
accuracy  by  allowing  ready  use  of  small  subimages. 
Thus,  parallel  systems  can  allow  the  full  benefits  of  adap¬ 
tive  thresholding  via  EGT  to  be  more  readily  realized. 

Speedup  is  the  usual  rationale  for  employing  parallel 
processing  techniques,  and  the  example  parallel  scenario 
has  the  potential  for  significant  speedup.  However,  the 
speedup  is  data  dependent.  This  is  because  the  PE  work¬ 
load  may  be  highly  varied  during  contour  tracing  due  to 
uneven  distribution  of  contours  throughout  the  image 
being  processed  While  it  may  be  possible  to  implement 
load  sharing  for  this  portion  of  the  scenario  (with  certain 
overhead  costs),  inequities  reducing  actual  speedup  are 
almost  certain  to  remain. 

Overall,  the  parallel  algorithms  presented  are  strong 
contenders  to  replace  serial  methods  in  some  applications. 
One  such  is  quality  control  inspection  of  printed  circuit 
boards.  In  this  application,  large  object  handling  capabil¬ 
ity  is  needed  for  following  long  circuit  traces,  and 
sutlicient  speedup  is  necessary  for  timely  response.  Other 
applications  involve  military  environments  where  real¬ 
time  processing  is  crucial. 

Vn.  Architectural  Implications 

The  study  of  a  parallel  formulation  of  an  image  pro¬ 
cessing  scenario  involves  both  the  design  of  individual 
parallel  algorithms  and  the  determination  of  a  method  to 
integrate  them  into  a  single  job.  This  leads  to  an  under¬ 
standing  of  necessary  and  useful  hardware  attributes  for 
a  parallel  machine  intended  to  execute  that  scenario.  For 
the  example  scenario,  aspects  of  each  algorithm  which 
have  an  architectural  impact  other  than  those  pertaining 
to  the  processors  will  be  listed.  Processor  specific  con- 
vidcrations  (e  g  ,  instruction  set)  are  not  treated  because 
they  are  similar  for  serial  and  parallel  machines. 

The  Sobel  edge  detection  algorithm  step  of  EGT 
requires  data  that  is,  by  vast  majority,  local  to  each  PE. 
When  non-local  data  is  required,  nearest  neighbor  PEs 
I'liiiiprise  the  set  of  data  sources.  Local  maxima  and 
minima  calculation  on  3-by-3  windows  mimics  the  charac¬ 
teristics  of  the  Sobel  operator,  but  with  more  memory 
references.  F^dge  point  detection  is  similar  in  these 
regards  to  the  previous  steps. 

The  figure  of  merit  calculation  for  EGT  is  different 
in  kind  from  the  previous  steps.  Only  local  data  is 
required,  and  processing  time  is  data  dependent.  MIMD 
operation  is  preferable  to  SIMD,  even  if  edge  point  detec¬ 
tion  and  figure  of  merit  calculations  are  merged  into  a 
one-pass  operation. 

f’hase  I  of  contour  tracing  requires  only  local  data, 
tiui  execution  time  is  data  dependent  Phase  11  makes 
heavy  use  of  non-local  data  and  has  data  dependent  exe¬ 
cution  time.  Both  phases  are  suited  to  MIMD  mode. 

Now  the  architectural  requirements  for  a  parallel 
machine  performing  the  example  scenario  can  be  con¬ 


sidered.  I’robably  the  most  basic  need  for  the  system  if  it 
is  to  support  the  scenario  well,  is  to  be  capable  of  dynam¬ 
ically  switching  between  SIMD  and  MIN®  operation,  as 
ran  PASM.  With  only  SIMD  capability,  vast  inefficiency 
Would  occur  in  later  stages  of  the  scenario.  Having  only 
MIMD  mode  is  a  less  serious  handicap,  but  will  lengthen 
execution  time  for  the  Sobel  operator  and  determining 
local  maxima  and  minima,  due  to  the  need  for  explicit 
synchronism  and  data  sharing.  Thus,  the  capability  to 
dynamically  switch  between  SIMD  and  MIMD  modes  is 
important  so  that  each  subsequent  portion  of  the  scenario 
can  be  executed  in  the  most  appropriate  operational 
mode. 

An  interconnection  network  is  needed  to  perform 
permutations  involving  eight  nearest  neighbors  in  SIMD 
mode.  In  MIMD  mode,  it  is  used  for  eight  nearest  neigh¬ 
bors  and  for  somewhat  arbitrary  one-to-one  connections 
(when  transferring  partial  contour  information  between 
non-adjacent  PEs).  Both  types  of  connection  needs  must 
be  performed  efficiently  by  the  network.  The  networks 
proposed  for  PASM  can  do  so. 

The  P&to-PE  transfer  of  information  must  be 
efficient,  or  the  parallel  algorithms  will  be  slowed.  One 
method  to  perform  P&to-PE  communication  is  by  using 
direct  memory  access  (DMA).  DMA  is  a  method  for  stor¬ 
ing  or  retrieving  data  without  processor  intervention. 
There  are  several  ways  to  implement  this  capability.  In 
one,  a  PE  extending  a  partial  contour  sends  an  interrupt 
to  the  remote  PE  containing  the  extension  of  the  partial 
contour  along  with  the  identifier  of  the  needed  partial 
contour.  The  remote  PE  then  enters  a  DMA  handling 
routine  This  routine  computes  the  local  memory  address 
range  of  the  requested  partial  contour  i-x-y  sequence  and 
sends  this  information  along  with  the  requesting  PE 
number  to  special  DMA  hardware.  The  DMA  hardware 
then  autonomously  retrieves  the  information  from  local 
memory  and  performs  necessary  network  interfacing  to 
send  the  data  to  the  requesting  PE.  DMA  hardware 
acces.ses  to  local  memory  can  be  via  cycle  stealing. 
Another  implementation  of  DMA  capability  is  through  an 
intelligent  network  interface  unit  (NTU).  Requests  for 
data  from  remote  PEs  would  be  received,  interpreted, 
and  discharged  by  the  NIU  without  local  PE  processor 
intervention.  The  NIU  would  combine  DMA  capability 
with  network  protocol  support.  VLSI  technology  may 
allow  ready  fabrication  of  sophisticated  NIUs.  Such  a 
capability  would  be  worthwhile  to  include  in  a  system 
such  as  PASM. 

Vm.  Summary 

Considering  an  entire  scenario  in  the  light  of  parallel¬ 
ism  is  a  useful  approach  for  matching  image  processing 
tasks  and  parallel  architectures.  A  number  of  observa¬ 
tions  were  made  and  conclusions  drawn  from  the  example 
image  processing  scenario.  In  particular,  the  parallel 
scenario  was  found  to  embrace  Doth  SIMD  and  MIMD 
subtasks,  involve  significant  P&to-PE  data  transfer,  and 
contain  both  nearest-neighbor  and  non-adjacent  PE  com¬ 
munication  patterns.  Parallel  formulation  of  the  algo¬ 
rithms  lead  to  several  advantages  including  speedup, 
elimination  of  object  size  constraints,  and  potential  for 
improved  accuracy.  These  observations  indicate  that 
parallel  contour  extraction  could  be  useful  in  industrial 
inspection  and  military  applications.  They  suggest  desir¬ 
able  system  architecture  features,  including  SIMD/MIMD 
capability  with  dynamic  mode  switching,  dedicated  PE^ 
to-PE  communication  support  hardware,  and  arbitrary 
PF>to-PE  interconnection  capability.  These  requirements 
are  consistent  with  the  capabilities  of  PASM. 
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Chapter  Nine 

The  Use  and  Design  of  PASM* 
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1.  INTRODUCTION 

Parallel  processing  has  been  successfully  used  to  reduce  the  time  of  compuu- 
tion  for  a  wide  variety  of  applications.  The  processing  of  large  amounts  of 
data,  the  need  for  real-time  computation,  the  use  of  computationally 
expensive  operations,  and  other  demands  that  would  make  a  task  too 
time-consuming  to  perform  on  conventional  computer  systems  have  forced 
computer  architects  to  consider  parallel/distributed  computer  designs.  Ap¬ 
plications  that  have  one  or  more  of  these  characteristic  demands  include 
image  analysis  for  automated  photo  reconnaissance,  map  generation,  robot 
(machine)  vision,  and  rocket  and  missile  tracking;  digital  signal  processing 
for  speech  understanding  and  biomedical  signal  analysis;  and  vector  process¬ 
ing  for  the  solving  of  large  systems  of  equations.  To  date,  a  variety  of 
special-purpose  machines  has  been  constructed  to  speed  the  processing  of 
select  groups  of  algorithms.  Examples  are  special-purpose  digital  signal 
processors  such  as  the  APS-II  [1],  array  processors  such  as  the  AP-120B 
(Floating  Point  Systems,  Inc.  Portland,  Oregon),  and  supercomputers  with 
vector/pipeline  operations  such  as  the  Cyber  205  [2]. 

Our  goal  is  the  design  of  a  flexible  parallel  processing  system  that  can  be 
dynamically  reconfigured  to  meet  the  particular  processing  needs  of  a  large 
variety  of  applications  in  the  image  and  speech  analysis  domains.  The  system 
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Army,  under  grant  number  DAAG29-g2-K-0l0l;  by  the  United  States  Air  Force  Command, 
Rome  Air  Development  Centre,  under  contract  number  F}0602-g3-K-01 19;  and  by  the  National 
Science  Foundation  under  grant  ECS-8I -20896. 


INTEGRATED  TECHNOLOGY  FOR 
PARALLEL  IMAGE  PROCESSING 


U3 


Copytighl  (g)  19(5  by  Academic  Press, 
Loodoo.  All  hftau  of  rcproductioa 


134  J.  T.  Kuehn,  H.  J.  Siegel,  D.  L.  Tuomenoksa,  and  G.  B.  Adams  III 


being  designed  is  a  PASM,  pnrtitionable  51MD/AfIM0  machine.  In  this 
chapter,  two  algorithou  used  in  parallel  contour  extraction  are  given  as  an 
image  processing  scenario  to  explore  the  advantages  and  implications  of  using 
the  PASM  parallel  processing  system  and  to  motivate  the  inclusion  of  its 
important  architectural  feamres.  These  features  will  help  to  identify  the 
attributes  of  a  custom-designed  VLSI  processor  chip  set  for  PASM.  In 
particular,  the  architectural  features  that  could  be  incorporated  into  a  VLSI 
chip  set  that  will  match  the  needs  of  parallel  algorithms  in  the  image  and 
speech  processing  domains  will  be  explored.  Using  algorithm  characteristics 
to  drive  the  design  of  PASM  will  lead  to  a  machine  that  has  the  necessary 
flexibility  for  executing  image  and  speech  processing  algorithms. 

In  the  next  section,  the  parallel  processing  model  and  an  overview  of  the 
PASM  architecture  are  given.  Section  3  outlines  two  algorithms  of  the 
contour-extraction  task.  The  flrst  algorithm,  edge-guided  thresholding,  is 
discussed  in  Section  4.  Section  S  describes  the  second  algorithm,  contour 
tracing.  The  architectural  implications  of  these  algorithms  are  explored  in 
Section  6. 


2.  SIMO/MIMD  MODEL 

Two  types  of  parallel  processing  systems  are  single-instruction  stream- 
multiplc-data  stream  (SIMD)  machines  and  multiple-instruction  stream- 
multiple-data  stream  (MIMD)  machines  [3].  A  SIMD  machine  typically 
consists  of  a  control  unit,  an  interconnection  network,  and  N  processing 
elements  (PEs),  with  each  PE  being  a  processor/memory  pair  (Fig.  1).  The 


Fig.  I  Model  of  an  SIMD/MIMD  machine. 
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Fig.  2  Block  diagram  overview  of  PASM. 


control  unit  broadcasts  instructions  to  the  processors,  and  all  active  (enabled) 
processors  execute  the  same  instruction  at  the  same  time.  Each  processor 
executes  the  instructions  with  data  taken  from  its  own  memory.  The 
interconnection  network  allows  interprocessor  commumcation.  A  MIMD 
machine  has  a  similar  organization,  but  each  processor  can  follow  an 
independent  instruction  stream.  As  with  SIMD  architectures,  there  is  a 
multiple  data  stream  and  an  interconnection  network.  The  control  umt  may 
coordinate  the  activities  of  the  PEs  in  MIMD  mode.  A  SIMD/ MIMD 
machine  can  operate  in  either  mode  and  dynamically  switch  between  them.  A 
partiuonable  SIMD/MIMD  system  (e.g.,  PASM  [41;  TRAC  [5),[61)  can 
be  dynamically  reconfigured  to  operate  as  one  or  more  independent 
SLMD/MIMD  machines  of  various  sizes. 

PASM  is  being  designed  using  a  variety  of  applications  problems  from  the 
areas  of  image  and  speech  analysis  to  guide  the  machine  design  choices.  It  is 
not  meant  to  be  a  production-line  machine  but  a  research  tool  for  studying 
large-scale  SIMD  and  MIMD  parallelism. 

A  block  Hiagram  of  the  basic  components  of  PASM  is  given  in  Fig.  2.  The 
heart  of  the  system  is  the  parallel  compmation  unis  (PCU),  which  contains 
JV  =  2"  processors,  N  memory  modules,  and  an  interconnection  network. 
The  PCU  processors  are  microprocessors  that  perform  the  SIMD  and  MIMD 
computations.  The  PCU  memory  modules  are  used  by  the  PCU  processors  for 
data  storage  in  SIMD  mode  and  both  data  and  instruction  storage  in  MIMD 
mode.  The  iraerconruction  nenvorh  provides  conununication  among  the  PEs. 
PASM  will  use  either  an  Extra  Stage  Cube  type  or  Augmented  Dau 
Manipulator  type  of  multistage  network  [7]. 

The  PCU  is  organized  as  shown  in  Fig.  3.  Each  processor  is  connected  to  a 
memory  module  to  form  a  PE.  A  pair  of  memory  umts  is  used  for  each 
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Fig.  3  PASM  Panllet  Computation  Unit. 


memory  module.  This  double-bufifering  scheme  allows  data  to  be  moved 
between  one  memory  unit  and  secondary  storage  while  the  processor 
operates  on  dau  in  the  other  memory  unit.  Each  memory  unit  is  of 
substantial  size  (e.g.,  64K  words).  PEs  are  addressed  (numbered)  from  0 
toAf  -  1. 

The  sysum  control  unit,  a  conventional  computer,  is  responsible  for  the 
overall  coordination  of  the  activities  of  the  other  components  of  PASM.  The 
memory  management  system  controls  the  loading  and  imloading  of  the  PE 
memory  modules  from  the  multiple  secondary  storage  devices  of  the  memory 
storage  system.  The  microcontrollers  (MCs)  are  a  set  of  microprocessors  that 
act  as  the  control  units  for  the  PEs  in  SIMD  mode  and  orchestrate  the 
activities  of  the  PEs  in  MIMD  mode.  Each  of  the  Q  MCs  controls  a  fixed 
group  of  S/Q  PCU  PEs.  By  combining  the  effects  of  multiple  MCs,  virtual 
machines  (partitions)  can  be  created.  Control  storage  contains  the  programs 
for  the  MCs.  PAS.M  is  being  designed  for  N  =  1024  and  Q  =  iZ.  An 
•V  =  16,  Q  =  4  prototype  based  on  Motorola  MC68000  processors  is  under 
development  [8], 

This  brief  overview  of  PASM  provides  the  needed  background  for  this 
chapter.  Further  details  and  a  list  of  papers  about  PASM  can  be  found  in 
Siegel  (91. 
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3.  EXAMPLE  TASK 

Many  individual  image  and  speech  processing  algorithms  and  their  formula¬ 
tions  for  parallel  processing  environments  have  been  studied,  such  as  2-D 
FFTs  [10], (11],  Hadamard  transforms  [12],  image  correlation  [13],  ^lo- 
gramming  [4],  resampling  [14],  one-dimensional  FFTs  [15],  linear  predictive 
coding  [16],  and  dynamic  time  warping  [17].  However,  rarely  is  a  complete 
scenario  considered  as  a  whole.  Consider  the  situation  in  which  the  results  of 
one  algorithm  are  used  as  input  to  another.  In  the  parallel  environment,  this 
may  strongly  influence  how  each  algorithm  is  structured.  For  example, 
results  calculated  in  one  PE  might  need  to  be  communicated  to  another  PE 
for  use  in  a  later  algorithm. 

Contour  extraction  is  a  key  tool  for  use  in  appUcations  ranging  torn 
computer  assisted  cartography  to  industrial  inspection.  Two  algonthms  from 
a  contour  extraction  task  wiU  be  used  as  an  applicaaon  exampte  for 
demonstrating  the  architectural  features  that  must  be  provided  by  PASM  to 
have  an  appropriate  execution  environment.  It  will  be  shown  how 
tional  attributes  of  a  paraUel  implemenuUon  of  this  example  SIMD/MIMD 
scenario  influence  the  hardware  design  choices,  including  those  features  that 
would  be  desirable  in  a  custom-designed  VLSI  chip  s«. 

The  two  algorithms  to  be  considered  are  edge-gutded  thraholdty  (EOT) 
and  contmir  tracing.  The  EOT  algorithm,  discussed  in  Sectionals  used  to 
determine  the  optimal  threshold  for  quantizing  the  image  (18).  The  contom- 
tracing  algorithm,  considered  in  Section  5,  uses  the  set  of  optimal  t^esholds 
to  segment  the  image  and  trace  the  contours.  These  two  parallel  algorithm 
are  based  on  those  developed  in  Tuomenoksa  et  al.  [19]  and  are  summan^ 
here  because  their  processing  demands  are  quite  different  from  each  other. 
As  wiU  be  seen,  the  EGT  algorithm  is  best  suited  for  SIMD  m^e,  wheiw 
MIMD  mode  will  be  used  for  the  contour  tracing  algorithm.  Also,  *e  hU  I 
algorithm  will  have  inter-PE  communication  needs  that  are  different  from  t^ 
communication  needs  of  the  contour  tracing  algorithm.  Other  a^u 

in  Section  6.  For  this  usk  scenario,  the  ability  to  paitmon  PASM  is 

not  used;  i.e.,  aU  JV  PEs  are  employed. 


4.  EOGE-GUIDEO  THRESHOLDING 

Consider  an  Af  x  Af  pixel  input  image  to  be  processed  by  the  two  algorithms. 
The  value  of  each  pixel  is  assumed  to  be  an  8-bit  unsigned  integer 
representing  one  of  256  possible  gray  levels.  Usin^e  P^M  model,  ass^ 
that  the  PEs  are  logi^y  configured  as  a  JN  x  ^  grid,  on  which 
the  Af  X  Af  image  is  superimposed;  i.e.,  each  processor  has  an 
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M/JN  X  M/yJN  subimage.  For  M  =  40%,  N  =  1024,  each  PE  stores  a 
128  X  128  subimage.  Each  input  image  pixel  is  uniquely  addressed  by 
its  i-X'y  coordinates,  where  x  and  y  are  the  x-y  coordinates  of  the 
pixel  in  the  subimage  contained  in  PE  i. 

The  EGT  algoritlun  consists  of  three  maior  steps.  First,  the  Sobel  edge 
operator  [20]  is  used  to  generate  an  edge  image  in  which  gray  levels  indicate 
the  magnitude  of  the  gradient.  A  figure  of  merit  that  indicates  how  well  a 
given  thresholded  gray-level  image  matches  edges  in  the  edge  image  is  then 
computed  for  every  possible  threshold.  Finally,  the  maximum  value  of  the 
figure-of-merit  function  is  chosen  to  determine  the  threshold  level.  This  is 
done  for  each  subimage  independently;  thus,  the  threshold  levels  may  differ 
from  one  subimage  to  the  next.  The  complete  EGT  algorithm  is  most  easily 
formulated  as  the  SIMD  procedure  given  in  Fig.  4.  Let  the  subimage 
SI  be  Mljn  ^  M/jN  and  SI(i,x  ,y)  be  a  subimage  pixel,  where 
0  ^  x,>  <  M/JN,  0  s  i  <  M.  The  algorithm  is  performed  for  all  of  the 
subimages  (all  i)  simultaneously. 

Referring  to  Fig.  4,  the  first  for  sutement  clears  the  sumedge  and  nedge 
counters  (to  be  described)  for  each  possible  threshold  value.  The  next  pair  of 
nested  for  statements  contains  statements  to  calculate  quantities  associated 
with  each  pixel  in  the  subinuge.  The  Sobel  operators,  sx  and  sy,  represent 
weighted  pixel  value  differences  in  the  x  andy  directions,  respectively.  The 
value  g(i,  x,y)  represents  the  gradient  at  pixel  (i,  x,y),  and  these  values  form 
(he  edge  image.  The  presence  of  an  edge  is  mdicated  by  high  edge  image  pixel 
values.  Next,  the  local  maximum  and  muiimum  pixel  values  over  a  3  x  3 
window  are  determined  for  each  gray-level  unage  putel.  Note  that  the  same 
image  pixels  necessary  for  the  calculation  of  the  gradient  can  be  re-used  for 
(he  determination  of  the  local  maxunum  and  minimum.  The  center  pixel 
of  the  3  X  3  window  is  an  edge  point  if  the  threshold  is  greater  than  or  equal  to 
the  local  minimum  and  less  than  the  local  maximum.  Running  sums  of  the 
edge  image  pucls  (gradient  values)  corresponding  to  edge  points  at  each 
threshold  (sumedge)  and  a  count  of  the  number  of  edge  pixels  for  each 
threshold  (nedge)  are  updated  in  the  mnennost  for  loop.  In  general,  each  PE 
performs  (his  for  statement  using  a  different  iocalmin  and  iocalmax  and  thus 
performs  the  statements  in  the  loop  (updates  the  sums)  various  numbers  of 
times.  This  implies  that  each  PE  has  the  capability  of  maintaining  its  own 
loop  mdex  values.  PEs  are  disabled  when  they  finish  their  looping,  because 
PEs  must  remain  synchronized  in  SIMO  mode.  The  total  time  to  perform  the 
innermost  far  loop  is  the  maximum  time  taken  by  any  PE. 

The  mean  for  each  threshold  (sumedge/nedge)  is  known  as  the  figure  of 
merit  (ment)  and  is  calculated  in  the  final  for  statement  using  the  accumu¬ 
lated  sums.  High  figure  of  merit  values  indicate  better  matches  between 
threshold-generated  boundaries  and  the  edges  detected  by  the  Sobel  oper- 


The  Use  and  Design  of  PASM  1 39 

for  thresh  =  0  to  2SS  do 
sumedge(i,  thresh)  =  nedge(i,  thresh)  =  0 
for  X  =  0  to  M/  JN  —  1  do  begin 
for  =  0  to  M/JN  -  1  do  begin 

sx(i,x,y)  =  i{(SI(i,x  -  1,3»  -  1)  +  2*SI(i,x  -  l.y) 

+  SI(i,x  -  \,y  +  1))  -  (SI(i,x  +  l,y  -  1) 

+  2*SI(i,x  +  \,y)  +  SI(i,x  +  l,y  +  1))J 

5y(i,x,y)  =  i[(SI(i,x  -  l,y  -  1)  +  2*SI(i,x,>  -  1) 

+  SI(i,x  +!,>-!))-  (SI(i,x  -  l,y  +  1) 

+  2*SI(i,x,>-  +  1)  +  SI(t,x  +  l,y  +  1))] 

«(«.  x.>)  =  y/sx(i,x,y)^  +  sy(ux,yy 

localinax(t) 

=  max(SI(i,x  -  l,y  -  1),  SI(i,x,3>  -  1),  SI(«,x  +  l,y  -  1), 
SI(i,x  -  l.y),  SI(i,x,3'),  SI(i,x  +  I.y), 

SI(i,x  -  l,y  +  1),  SI(i»x.>  +  J)>  SI(i,x  +  l,y  +  1)) 
localmin(t) 

=  inin(SI(i,x  -  1,31  -  I),  SI(i,x,>  -  1),  SI(i,x  +  \,y  -  1), 
SI(i,x  -  \,y),  SI(i,x,y),  SI(»,x  +  l,y), 

SI(i,x  -  l,y  +  I),  SI(i,x,y  +  1),  SI(i,x  +  l,y  +  I)) 

for  thresh  =  localmin(i)  to  localmax(>)  -  1  do  begin 
sumedge(i,  thresh)  =  sumedgefi,  thresh)  +  g(.i,x,y) 
nedgefi,  thresh)  =  nedgefi,  thresh)  +  1 


for  thresh  =  0  to  2SS  do 

meritfi,  thresh)  =  sumedge(i,  thresh)/nedge(i,  thresh) 
Fig.  4  EGT  algorithm. 


ator.  The  gray  level  associated  with  the  maximum  value  of  the  figure-of-meht 
function  is  chosen  for  image  segmentation. 

The  EGT  algorithm  is  particularly  well  suited  for  SIMD  parallelism 
because  all  pixels  are  processed  similarly.  This  aids  the  PE-to-PE  com¬ 
munication  necessary  when  a  PE  must  process  pixels  not  in  its  subimage  (i.e., 
in  a  neighbor  PE).  All  PEs  will  simultaneously  request  the  same  pixel  relative 
to  their  subimages.  For  example,  when  processing  begins  (with  the  upper  left 
comer  subimage  pixel)  all  PEs  will  request  (from  the  PE  to  their  upper  left) 
the  pixel  immediately  above  and  to  the  left  of  their  upper  left  comer  pixel  (if 
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this  pixel  is  in  the  complete  image).  This  transfer  of  dau  from  upper  left 
neighbors  can  occur  for  ail  PEs  simultaneously.  In  the  case  of  this  algorithm, 
transmission  delays  incurred  due  to  PE-to*PE  data  transfers  can  be  over¬ 
lapped  with  dau  processing  to  reduce  total  execution  time.  A  total  of 
^M/  JN  +  1)  parallel  transfers  are  needed  for  a  x.  M  pixel  image.  The 
candidate  interconnection  networks  of  PASM  can  support  these  parallel 
transfers  from  any  neighboring  PE. 

Since  PE-to-PE  communications  in  MIMD  mode  require  explicit  synchro¬ 
nization  between  the  two  processors  for  each  data  transfer,  SIMD  mode 
transfers  should  be  used  to  provide  each  PE  more  efficiendy  with  the 
one-pixel-deep  border  points  of  its  subimage  (from  its  neighbors).  However, 
once  each  PE  has  all  of  the  data  it  needs  to  perform  the  EGT  algorithm,  the 
calculations  could  proceed  in  MIMD  mode.  Although  MIMD  mode  would 
make  the  execution  of  the  iimermost  for  loop  more  efficient  (because  no  PEs 
would  be  disabled),  this  advantage  must  be  weighed  against  the  extra  rime 
involved  in  switching  from  SIMD  to  MIMD  mode  and  requiring  that  each 
PE  perform  its  own  control  flow  operations  for  the  outer  two  for  loops. 
Control  flow  operations  include  initialization  and  incrementing  of  loop 
counters,  evaluation  of  conditional  expressions,  and  branching.  These  opera¬ 
tions  are  performed  by  the  MC  in  SIMD  mode  for  the  outer  two  loops  and 
can  be  overlapped  with  the  PE  operations.  The  next  step  of  the  scenario  is 
contour  uacing. 

5.  CONTOUR  TRACING 

A  contour  tracing  algorithm  using  MIMD  parallelism  and  based  on  the  one 
given  in  Tuomenoksa  el  al.  [19]  is  summarized  in  this  section.  Initially,  each 
PE  contains  a  threshold  value  T  for  its  subimage,  which  was  calculated  using 
the  EGT  algorithm  of  the  previous  section.  The  contour  tracing  algorithm 
has  two  phases.  In  Phase  I,  the  PEs  segment  their  subimages  based  on  the 
threshold  and  all  local  contours  (both  closed  and  partial)  are  traced  and 
recorded.  In  Phase  II,  the  partial  contours  traced  during  Phase  I  are 
connected. 

A  coniour  table  is  constructed  in  each  PE,  containing  an  entry  for  every 
partial  or  complete  contour  in  its  subimage.  Each  contour  table  entry 
contains  bookkeeping  information  such  as  the  threshold  value  that  generated 
the  contour  and  a  pointer  to  the  the  i-x-y  sequence  of  the  contour.  Each  PE 
also  contains  a  partial  coniour  list,  which  has  an  entry  for  each  partial  contour 
containing  the  i-x-y  coordinates  of  its  two  end  points  and  a  pointer  to  its 
contour  table  entry. 
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In  Phase  I  there  is  no  PE-to-PE  communication.  Each  PE  uses  itt 
threshold  level  to  segment  its  subimage.  To  create  the  segmented  image  for 
threshold  T,  subimage  pixels  that  have  a  value  greater  than  or  equal  to  T  are 
assigned  a  value  of  one;  otherwise,  the  pixels  are  assigned  a  value  of  zero. 

Contour  tracing  begins  by  scanning  rows  of  the  segmented  image  begin¬ 
ning  with  the  top  row.  Scanning  stops  when  a  pixel  with  a  value  of  one  is 
found  that  has  a  zero-valued  neighbor  on  both  sides.  This  pixel  is  marked  as 
the  start  point  of  a  new  contour,  and  its  i-x-y  coordinates  are  stored.  For  edge 
PEsj  i.e. ,  those  on  the  edge  of  the  JN  x  Jn  grid  of  PEs,  no  image  points  lie 
beyond  the  edge;  thus,  all  points  in  the  leftmost  (or  rightmost)  column  of  the 
subimage  of  the  PEs  in  the  leftmost  (or  rightmost)  column  of  the  grid  of  PEs 
are  potential  start  points.  For  all  other  left  and  right  subimage  edges,  it  is 
assumed  that  the  pixel  in  the  neighboring  PE  is  one-valued  so  that  spurious 
start  points  are  not  chosen.  Bypassing  a  potential  stan  point  (e.g.,  a  left 
subimage  edge  with  a  zero-valued  neighbor  in  the  PE  to  its  left)  is  not  a 
problem  because  (1)  contours  have  multiple  potential  start  points  within  the 
subimage  and  (2)  the  partial  contours  will  be  connected  in  Phase  II  regardless 
of  the  start  point  chosen. 

The  contour  is  first  traced  in  a  counterclockwise  direction  (CCW)  if  the 
start  point  has  a  one-valued  point  to  its  right  and  is  first  traced  in  a  clockwise 
direction  (CW)  if  the  start  point  has  a  one-valued  point  to  its  left.  If  there  are 
zeroes  on  both  sides,  the  initial  direction  chosen  does  not  matter.  Consider 
the  start  point  pixel  as  the  center  pixel  of  the  3  x  3  window  in  which 
direction  0  is  east,  1  is  northeast,  and  so  on  [21].  The  CCW  algorithm  is 
suted  as  follows.  Beginning  with  the  neighboring  pixel  in  direction  five  and 
incrementing  by  1  modulo  8  to  determine  the  next  pixel,  look  for  a  pixel  that 
has  a  value  of  one.  When  it  is  found,  store  the  direction  p  of  this  new  pixel 
and  append  its  i-x-y  coordinate  to  the  contour  sequence.  Treat  this  pixel  as  a 
new  center  point  of  the  3x3  window.  Then  continue  by  looking  for  the  next 
pixel  in  the  contour  beginning  with  the  pixel  in  position  (;>  +  5)  modulo  8. 
Tracing  continues  until  the  start  point  or  a  subimage  boundary  (point  of 
indecision)  is  reached.  The  CW  algorithm  is  similar,  but  scanning  begins 
with  the  pixel  in  position  zero  and  decrements  by  1  modulo  8  to  determine 
the  next  pixel.  After  a  point  is  found,  the  pixel  in  position  {p  +  i)  modulo  8 
is  scanned.  Horizontal  edges  that  span  a  subimage  are  also  recognized; 
however,  they  are  treated  as  a  special  case  because  no  stan  point  would  have 
been  identified.  An  impUcit  assumption  is  that  all  contours  to  be  traced  define 
regions  that  have  area.  Examples  of  illegal  contours  that  would  not  be  traced 
are  one-pixel-wide  lines  or  isolated  points. 

A  point  of  indecision  is  reached  when  a  pixel  from  an  adjacent  subimage  is 
needed  to  determine  the  next  direction  of  the  contour  [19].  When  a  point  of 
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indecision  is  reached,  it  is  recorded  as  an  end  point,  and  the  algorithm  returns 
to  the  start  point  to  trace  the  contour  in  the  opposite  direction  until  another 
point  of  indecision  is  reached.  When  tracing  in  the  CW  (or  CCW)  direction, 
the  new  contour  pixels  are  inserted  onto  the  front  (or  back)  of  the  i-x~y 
sequence.  Pixels  in  the  thresholded  image  are  marked  so  that  the  contour  will 
not  be  retraced. 

As  an  example,  a  30  x  20  image  is  divided  into  six  10  x  10  subimages; 
each  subimage  is  loaded  into  one  of  six  PEs.  The  result  of  Phase  I  processing 
is  shown  in  Fig.  S  where  a  dot  indicates  a  one-valued  pixel.  Even  though  the 
entire  object  in  PE  S  was  located  within  the  subimage,  the  left  edge  of  the 
object  was  not  traced  in  Phase  I,  because  PE  5  could  not  determine  whether 
the  object  continued  into  the  next  subimage.  On  the  other  hand,  a  closed 
contour  was  found  in  Phase  I  for  the  object  in  PE  4,  because  the  object  did 
not  include  any  border  pixels  of  the  subimage. 
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Fig.  5  Results  of  Phase  I  of  contour  tracing  for  a  30  x  20  subimage.  (Based  on 
Tuomenoksa,  et  al.,  [19].) 
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For  the  example  in  Fig.  S,  the  local  threshold  value  T  is  applied  to  the 
subimage  in  each  PE.  Each  PE  i  begins  scanning  its  respective  subimage  at 
pixel  (i,  0,  0,)  for  a  one  (indicated  by  a  dot)  with  a  zero  on  either  side. 
Depending  on  the  start  point  found,  tracing  will  proceed  in  either  the  CW  or 
CCW  direction.  For  example,  contours  A,  C,  E,  D,  and  G  are  traced  in  the 
CCW  direction  first,  whereas  contours  B,  F,  and  H  are  traced  in  the  CW 
direction  first.  In  the  example,  PEs  1  and  3  have  found  two  start  points  and 
have  produced  two  traces.  Once  a  PE  has  scanned  the  segmented  image 
generated  by  its  threshold.  Phase  I  is  complete. 

In  Phase  II,  each  PE  attempts  to  connect  its  partial  contours  to  those 
located  in  neighboring  PEs.  In  order  for  a  PE  to  extend  a  contour,  it  must  be 
able  to  access  and  modify  contour  tables  that  are  located  in  other  PEs.  As  a 
result,  a  mechanism  to  allow  access  to  a  contour  table  entry  by  only  one  PE  at 
a  time  must  be  provided  by  the  system  and  used  by  the  contour  tracing 
algorithm.  A  semaphore  [22]  associated  with  each  contour  table  entry  is  used 
to  indicate  whether  or  not  that  entry  is  locked  so  that  no  other  processor  can 
access  it.  Semaphores  are  used  to  prevent  variable  access  and  updating 
problems  due  to  interrupts.  Details  of  these  problems  are  beyond  the  scope  of 
this  paper. 

For  the  example  of  Fig.  5,  PE  0  might  try  to  extend  the  CW  end  point  of 
partial  contour  A  by  considering  the  possible  extending  pixels  in  PE  1  one  at 
?.  time  using  the  CW  algorithm.  To  do  this,  PE  0  first  locks  the  contour  uble 
entry  for  A.  Then  PE  0  requests  that  PE  1  check  its  partial  contour  lists  to 
determine  if  any  partial  contour  has  the  possible  extending  point  as  an  end 
point.  If  such  a  partial  contour  exists,  PE  1  locks  the  contour  table  entry 
pointed  to  by  the  partial  contour  list  signifying  that  this  entry  is  to  be  linked. 
In  this  case,  PE  0  determines  that  A  can  be  linked  to  B;  thus,  PE  1  locks  B’s 
contour  uble  entry  so  that  only  PE  0  will  be  allowed  to  coimect  the  partial 
contour.  The  t-x-y  sequence  for  contour  B  is  transferred  to  PE  0  and 
concatenated  to  the  i-x-y  sequence  of  partial  contour  A,  forming  a  new, 
extended  partial  contour  AB.  If  PE  0  found  the  contour  uble  of  partial 
contour  B  to  be  already  locked,  it  will  not  be  allowed  to  connect  the  contour. 
The  extension  of  comer  points  is  handled  similarly  but  involves  communica¬ 
tion  with  more  than  one  PE.  Note  that  the  use  of  semaphores  prevents 
another  PE,  i.e.,  PE  3,  from  using  PE  1  to  access  B's  contour  uble  entry 
which  PE  1  is  in  the  process  of  modifying  for  PE  0. 

Once  PE  i  locates  a  partial  contour  in  an  adjacent  subimage  that  continues 
the  contour  and  has  stored  the  concatenated  contour  in  its  contour  table,  it 
repeals  the  process,  if  necessary,  by  following  the  contour  to  the  next  PE 
until  the  contour  is  closed  or  cannot  be  extended. 

Independently  of  the  actions  of  PE  0,  PE  3  might  attempt  to  extend 
contour  D  CCW  to  form  the  partial  contour  DC.  If  PE  3  anempted  to  extend 
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O  Pack  traced  in  Phaee  I 
fl  Patels  traced  ia  Phase  D 
O  Firat  pixel  la  the  s*y  eequeace  of  the  contour 

Fig.  6  Resulu  of  Phase  II  of  contour  tracing  for  a  30  x  20  subimage.  (Based  on 
Tuomenoksa,  ti  al.,  [19].) 


the  result,  DC,  when  PE  0  is  in  the  process  of  extending  A  into  PE  I,  it  will 
hnd  A  locked.  PE  3  then  abandons  its  attempt  to  close  the  contour,  because 
PE  0  is  also  attempting  to  do  it,  and  unlocks  partial  contour  DC.  This  allows 
PE  0  to  access  DC  after  it  has  appended  B  to  A.  Therefore,  the  closed  contour 
ABDC  is  ultimately  traced  by  PE  0.  Alternatively,  if  PE  0  had  completed 
linking  B  to  A  before  PE  3  completed  linking  C  to  D,  and  PE  0  finds  D 
locked,  it  would  unlock  AB.  Thus,  the  closed  contour  would  have  been 
completely  traced  by  PE  3.  Not  allowing  a  PE  to  wait  for  access  to  another 
PEs  locked  contour  table  entry  and  requiring  the  blocked  PE  to  unlock  its 
affected  partial  contour  prevents  deadlock. 

Occasionally,  some  contour  tracing  operations  must  be  performed  in  Phase 
II  before  certain  contours  can  be  linked.  Figure  6  shows  a  situation  in  which 
PE  2  traces  contour  E  along  the  subimage  boundary  in  Phase  II  before 
linking  it  to  contour  F.  The  subimage  boundary  pixels  of  contour  H  are  also 
traced  in  Phase  II. 
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These  examples  demonstrate  the  basic  ideas  underlying  the  algorithms. 
Tlie  actual  parallel  algorithm  details  that  ensure  proper  interaction  of  the  PEs 
are  complex  and  are  not  examined  here. 

When  Phase  II  of  the  algorithm  is  complete,  the  i-x-y  sequence  for  each 
contour  in  the  image  will  be  contained  in  exactly  one  of  the  PEs  that 
contained  part  of  the  contour  originally.  The  result  of  Phase  II  processing  for 
the  example  for  Fig.  S  is  shown  in  Fig.  6.  Since  each  PE  tries  to  connect  its 
contours  independently,  the  number  of  the  PE  that  finally  closes  a  given 
contour  is  nondetenninistic.  Although  this  may  not  be  desirable  in  a  few 
cases,  in  general  the  lack  of  a  specific  protocol  determining  which  PEs  can 
close  contours  equalizes  both  the  processing  load  of  each  PE  and  the  number 
of  closed  contours  that  eventually  reside  in  each  PE. 


«.  ARCHITECTURAL  IMPLICATIONS 

The  study  of  a  parallel  image  processing  task  leads  to  an  understanding  of 
necessary  and  useful  hardware  features  for  a  system  such  as  PASM.  For  the 
example  algorithms,  aspects  of  each  that  have  an  architectural  impact  will 
be  listed.  Processor-specific  considerations  (e.g.,  instruction  set)  are  also 
treated,  because  they  can  have  a  profound  effect  on  the  performance  of  the 
algorithms. 

Although  only  two  closely  related  algorithms  were  presented  in  the 
previous  sections,  the  two  could  hardly  have  been  more  different  in  their 
processing  demands.  As  discussed  in  Section  4,  the  EGT  algorithm  is  best 
suited  for  SIMD  mode.  This  is  because  the  algorithm  requires  dau  that  are 
mostly  local  to  each  PE.  Also,  there  are  approximately  (or  exactly)  the  same 
number  of  pixels  to  be  processed  in  each  PE,  and  all  pixels  are  processed 
similarly. 

When  nonlocal  data  are  needed  in  the  EGT  algorithm,  the  eight  nearest- 
neighbor  PEs  comprise  the  set  of  daU  sources.  The  PE-to-PE  transfer  of 
information  must  be  efficient,  or  the  parallel  algorithms  will  be  slowed.  In  its 
simplest  form,  this  communication  would  be  handled  entirely  by  the  PEs; 
each  PE  would  control  the  network  settings  (through  the  use  of  routing  tags 
[7])  and  perform  all  of  the  network  protocol  support  (e.g.,  buffering,  error 
detection).  Each  word  transferred  and  each  new  network  setting  would 
require  processor  instructions.  A  more  efficient  method  of  PE-to-PE  com¬ 
munication  is  by  direct  memory  access  (DMA).  DMA  is  a  means  by  which  data 
can  be  retrieved  from  one  memory  location  and  stored  in  another  without 
processor  intervention.  The  DMA  hardware  usually  operates  on  a  cycle¬ 
stealing  basis  so  that  a  PE’s  access  to  its  memory  is  not  severely  affected.  In 
its  basic  form,  PEs  in  SIMD  mode  would  enter  a  DMA  handling  routine. 
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This  routine  computes  the  local  memory  address  range  of  the  points  to  be 
transferred  and  sends  this  information  to  the  special  DMA  hardware.  The 
PEs  then  would  compute  the  destination  address  of  the  PE  that  is  to  receive 
the  data  and  set  the  network  accordingly.  The  DMA  hardware  then  would 
autonomously  retrieve  the  information  from  local  memory  and  perform  the 
necessary  network  mterfacing  to  send  the  data  .o  the  requesting  PE. 
However,  the  PE  would  still  be  responsible  for  checking  the  incoming  data 
(after  the  transfer  is  complete)  fur  transmission  errors  and  so  forth.  A  more 
advanced  unplementation  of  DMA  capability  is  the  use  of  an  intelligent 
nettuork  interface  unu  (NIU).  Requests  for  data  from  remote  PEs  would  be 
made  to  the  local  NIU,  which  would  interpret  and  satisfy  the  request  by 
coordinating  with  a  remote  NIU.  The  NIU  would  combine  DMA  capability 
with  network  protocol  support.  VLSI  technology  may  allow  ready  fabrication 
of  sophisticated  NIUs.  _ 

As  discussed  in  Section  4,  M/  v  Af  pixels  come  from  each  of  four  ^ighbors. 
For  the  sake  of  example,  let  M  =  40%,  N  =  1024,  and  M/^  N  =  128. 
Rather  than  involving  the  source  and  destination  PEs  in  the  individual 
transfers  of  these  points,  one  of  the  DMA  modes  just  described  would  be  of 
great  use.  If  pixels  were  stored  in  PEs  by  row  (rows  numbered  0-127),  and 
the  transfer  from  PE  i  to  PE  i  +  JN  was  selected,  the  DMA  hardware  of  PE 
I  would  be  instructed  to  transfer  128  pixels  starting  at  the  address  of  row  127 
of  the  image.  The  DMA  hardware  associated  with  PE  i  +  JN  would  be  set  to 
read  128  pixels  from  the  network  and  store  them  beginning  at  an  address 
representing  row  -  1.  When  data  are  transferred  from  a  PE  i  to  PE  «  +  1, 
the  situation  is  more  complicated  in  that  image  data  to  be  transferred  are  not 
contiguous.  Conventional  DMA  hardware  only  supports  physical  block 
transfers  of  data.  Here,  a  strong  case  for  an  intelligient  NIU  is  made:  the  NIU 
could  accept  more  complicated  instructions  such  as  “transfer  128  pixels 
starting  at  address  X,  taking  every  128th  pixel.” 

The  processing  requirements  (instruction  set)  for  the  EGT  algorithm  are 
not  out  of  the  ordinary.  LSI  technology  already  allows  the  fabrication  of 
complete  microprocessors  having  all  required  arithmetic  and  data  manipula¬ 
tion  operations  on  a  single  chip.  Recent  designs  (e.g.,  Motorola  68000  [23]) 
handle  a  variety  of  data  formats  including  bit,  byte,  16-bit  word,  and  32-bit 
long  word  types.  Floating  point  and  special  arithmetic  function  (e.g.,  square 
root,  trigonometric)  capability  abounds  in  the  form  of  coprocessor  chips. 
Although  the  EGT  algorithm  involved  only  one  special  function  (square  root) 
in  the  calculation  of  the  gradient,  other  algorithms  such  as  image  rotation, 
parallel  root  finding,  and  FFTs  for  speech  processing  make  hea\7  use  of 
spiecial  functions.  Since  many  of  the  s{>ecial  arithmetic  functions  are  calcu¬ 
lated  by  iterative  procedures,  a  strong  case  is  made  for  including  hardware  to 
perform  these  operations  rather  than  performing  them  in  software.  Software 
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procedures  in  which  the  number  of  iterations  required  is  data-dependent  are 
especially  troublesome  in  SIMD  mode,  because  processors  must  be  disabled 
as  they  complete  the  desired  number  of  iterations.  Also,  the  total  rimp  to 
perform  an  operation  is  the  maximum  time  required  by  any  processor 
(because  the  PEs  are  synchronized).  There  is  a  slight  advantage  to  having 
special-purpose  arithmetic  functions  on  the  same  standard  CPU  chip  in  that 
data  to  be  processed  need  not  be  moved  between  the  two  devices.  VLSI 
technology  should  make  such  combined  CPU-specialized  arithmetic  proces¬ 
sor  chips  a  reality. 

The  processors  must  be  capable  of  operating  in  SIMD  mode  efficiendy. 
Although  designs  for  using  off-the-shelf  microprocessors  as  SIMD/MIMD 
processing  elements  have  been  developed,  some  external  hardware  would  be 
required  to  enable,  disable,  and  synchronize  PEs  and  get  them  to  operate  in 
slave  mode,  i.e.,  to  accept  instructions  broadcast  by  a  control  unit  rather  than 
to  take  the  instructions  from  their  local  memories  [8].  This  external  hardware 
could  be  easily  incorporated  into  a  VLSI  chip. 

The  EGT  algorithm  has  been  simulated  for  N  ranging  from  16  to  2S6  and  a 
total  image  size  of  64  x  64  pixels.  A  special-purpose  SIMD  simulator 
developed  to  evaluate  the  MC68000-based  PASM  design  described  in  Kuehn 
et  al.  was  used  to  perform  the  simulations.  Although  the  details  of  the 
simulation  results  are  not  presented  here,  the  general  trends  of  the  results  will 
be  described. 

As  the  number  of  PEs  {N)  decreased,  the  subimage  size  increased  because 
a  fixed-size  total  image  of  64  x  64  pixels  was  used.  For  large  subimages,  the 
ratio  of  subimage  edge  pixels  to  total  subimage  pixels  is  low,  making 
processing  very  efficient.  This  is  because  inter-PE  transfers  make  up  only  a 
very  small  fraction  of  the  total  processing  time.  A  speedup  factor  (serial 
execution  time/parailel  execution  time)  approaching  N  was  obtained  for 
arithmetic  operations  for  this  case.  (A  speedup  of  Af  is  optimal.)  As  N  was 
increased  to  2S6  PEs,  the  subimage  size  decreased  to  4  x  4.  Here,  the  ratio  of 
subimage  edge  pixels  to  total  subimage  pixels  is  very  high,  and  inter-PE 
transfers  make  up  a  large  percentage  of  the  total  processing  time.  Although 
the  total  processing  time  is  minimized  as  N  increases,  the  speedup  factor 
decreases.  The  simulations  imply  that  N  should  be  as  large  as  possible  for  the 
EGT  algorithm  to  minimize  the  processing  time.  However,  this  would  make 
contour  tracing  (the  next  algorithm  of  the  scenario)  inefficient,  because  few 
contours  would  be  traced  in  Phase  I,  and  heavy  use  of  inter-PE  communica¬ 
tion  would  be  needed  to  close  the  contours  in  Phase  II.  Thus,  the  scenario 
must  be  considered  as  a  whole  rather  than  as  a  sequence  of  individual 
algorithms. 

Turning  now  to  the  contour  tracing  algorithm  of  Section  5  we  note  that 
both  phases  of  the  algorithm  are  suited  to  MIMD  mode,  because  they  involve 
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data-dependent  execution  times.  Phase  1  of  contour  tracing  requires  only 
local  dau,  whereas  Phase  II  makes  heavy  use  of  nonlocal  data.  Phase  I 
imposes  no  exuaordinary  requirements  on  the  system,  because  there  are  no 
special  arithmetic  operations  and  no  network  transfers  to  be  done.  Phase  II, 
however,  with  its  arbitrary  one-to-one  connections  (when  transferring  partial 
contour  information  between  nonadjacent  PEs),  use  of  semaphores,  and 
special  signaling  protocols  imposes  many  new  architectural  requirements. 

The  interconnection  network  and  any  DMA  or  NIU  hardware  would  be 
heavily  used  in  Phase  II  processing  when  PEs  extending  partial  contours 
probe  remote  PE  memories  that  may  contain  the  extensions  of  the  partial 
contours.  As  in  the  EGT  algorithm,  NIU  hardware  would  be  of  great  use, 
because  it  could  process  queries  about  possible  extensions  to  partial  contours 
without  interrupting  the  remote  PE.  There  would  be  a  combination  of  short 
and  long  messages  between  PEs  during  this  phase.  A  short  message  would 
occur  when  a  PE,  extending  a  partial  contour,  requests  information  about 
possible  extending  pixels  from  a  remote  PE.  If  a  connecting  partial  contour  is 
found,  a  long  message,  consisting  of  the  i-x-y  sequence  of  the  partial 
contour,  would  be  sent.  Thus  the  interconnection  network  should  support  a 
variety  of  message  sizes  so  that  the  efficiency  of  sending  either  type  of 
message  is  high. 

Since  semaphores  play  a  large  part  in  ensuring  correct  linking  of  partial 
contours  in  Phase  II,  processors  must  be  equipped  with  lest-and-set  or  similar 
operations  to  facilitate  a  correct  semaphore  implementation.  Most  modem 
microprocessors  already  have  some  semaphore  capabilities. 

If  the  system  is  to  support  the  execution  of  the  two  example  algorithms 
well,  it  must  be  capable  of  dynamically  switching  between  SIMO  and  MIMD 
operation,  as  PASM  can.  With  only  SIMO  capability,  the  contour  tracing 
algorithm  would  be  executed  with  huge  inefficiencies,  because  there  would 
be  varying  numbers  and  lengths  of  contours  and  arbitrary  one-to-one  com¬ 
munication  patterns.  A  machine  having  only  MIMD  mode  would  be  less 
seriously  aifected  but  would  lengthen  execution  time  for  the  EGT  algorithm, 
due  to  the  need  for  explicit  synchronization  for  each  data  transfer  step  and 
the  overhead  of  loop  counter  processing  which  is  done  concurrently  by  the 
MCs  in  SIMD  mode.  Thus,  the  capability  to  dynamically  switch  between 
SIMD  and  MIMD  modes  is  important  so  that  each  algorithm  can  be  executed 
in  the  most  appropriate  mode  of  parallelism. 

Since  PASM  is  an  SLMD/MIMD  system,  the  interconnection  networks 
proposed  for  PASM  would  be  capable  of  operating  both  synchronously  and 
asynchronously.  The  proposed  networks  are  of  the  multistage  type  and  can 
perform  both  the  nearest-neighbor  and  arbitrary  one-to-one  connections. 

The  design  of  a  multi-microprocessor  system  that  could  be  used  as  a 
building  block  for  PASM  is  discussed  in  Kuehn  el  at.,  [8].  This  design  uses 
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the  Motorola  MC68000  as  the  hean  of  both  the  PE  and  MC  components.  The 
extra  hardware  needed  for  SIMO/MIMD  mode  processing  and  communica¬ 
tion  was  described.  It  was  found  that  most  of  the  extra  hardware  was  inv  ^ved 
in  the  enabling/disabling,  synchronization,  and  instruction  broadca^ung  for 
SIMD  mode  and  in  getting  the  PEs  to  switch  from  SIMD  to  MIMD  mode 
and  back  again  efficiently.  The  design  highlights  are  described: 

MC  CPU.  The  MC  CPU  is  a  Motorola  MC68000-series  processor. 

Feuh  unit.  This  unit  fetches  instructions  from  MC  memory  in  SIMD 
mode,  determines  whether  they  are  control  (MC)  or  data  processing  (PE) 
instructions,  and  broadcasts  them  either  to  the  MC  CPU  or  PE  CPUs.  Each 
instruction  word  in  the  MC  memory  is  tagged  to  allow  the  fetch  unit  to 
determine  its  type.  The  tags  are  generated  at  assembly  time. 

Masking  operations  unit.  This  is  specialized  hardware,  under  the  control  of 
the  MC  CPU,  that  produces  a  mask  (pattern)  used  to  selectively  enable  or 
disable  PEs  (used  in  SIMD  mode). 

MCI  PE  interface.  This  is  specialized  hardware  to  queue  PE  instrucdoos 
and  enable  signals  broadcast  to  the  PEs.  The  queue  has  been  shown  to 
increase  the  amount  of  program  overlap  between  the  MC  and  PEs.  This 
interface  is  for  SIMD  mode;  there  would  also  be  a  MC/PE  communication 
bus  for  MIMD  mode  and  error-handling  messages  (which  is  not  discussed 
here). 

PE  CPU.  The  PE  CPU  is  a  Motorola  MC68000-series  processor. 

SIMD /MIMD  mode  swiuking  logic.  This  is  a  specialized  address  decoder 
that  generates  instruction  requests  to  the  MC/PE  interface  in  SIMD  mode 
and  causes  local  PE  memory  to  be  accessed  in  MIMD  mode. 

Network  interface  unit.  This  unit  bandies  DMA  and  network  protocoL 

VLSI  technology  should  be  used  to  combine  the  components  listed  above 
only  when  some  speed  or  complexity  advantage  is  gained.  For  example,  the 
PE  CPU  and  SIMD/MIMD  mode  switching  logic  should  be  combined  into  a 
single  component  so  that  the  PEs  can  operate  equally  well  in  SIMD  and 
MIMD  mode.  This  action  would  result  in  very  little  additional  silicon  area 
and  at  most  a  few  additional  pins  being  used.  Taking  this  one  step  further, 
one  could  also  fabricate  the  DMA  and  NIU  hardware  on  the  PE  CPU  chip. 
However,  to  allow  communication  on  the  CPU  data  bus  (with,  for  example, 
the  local  memory  chips)  and  the  NIU-interconnection  network  bus  to  occur 
simultaneously,  pins  for  a  complete  NIU  bus  interface  would  have  to  be 
added.  The  technology  at  implementation  time  would  determine  the  max¬ 
imum  pin  count  and  thus  the  suitability  of  this  scheme. 

Similarly,  the  MC  CPU  and  fetch  unit  should  be  combined  on  one  chip  so 
that  MC  operations  such  as  fetching  SIMD  instructions  and  branching  are 
done  by  the  same  unit.  The  masking  operations  unit  could  easily  be  made  a 
part  of  the  MC  CPU  since  it  is  not  too  complex;  however,  the  number  of  CPU 
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pins  would  have  to  increase  by  N/Q.  For  the  PASM  design  goal  olN  =  1024 
and  Q  =  32,  N/Q  =  32.  Again,  the  desirability  of  integrating  this  unit  is 
dependent  on  pin  count  limitations.  The  MC/PE  interface  is  also  a 
candidate  for  inclusion  on  the  MC  chip.  It  would  not  require  much  silicon 
area,  but  its  pin  requirements  are  high.  Since  the  interface  queues  both 
enable  signals  and  instruction  words  to  be  broadcast  to  the  PEs,  an  additional 
N/Q  +  16  bits  would  be  required  on  the  MC  CPU  package  (for  MC68000 
16-bit  words).  Thus,  assuming  that  the  number  of  pins  that  the  MC  CPU 
alone  requires  is  P,  if  the  masking  operations  is  integrated  with  the  CPU, 
P  +  N/Q  pins  would  be  required;  if,  in  addition,  the  MC/PE  interface  is 
integrated,  P  +  N/Q  +  16  pins  would  be  required  (the  masking  operations 
unit  output  pins  to  the  MC/PE  interface  would  now  be  internal  to  the  chip). 
As  has  been  discussed  in  McMillen  and  Siegel  (24],  VLSI  implementation  of 
interconnection  network  functions  is  most  promising,  both  from  a  functional 
standpoint  and  a  design  standpoint  due  to  network  regularity. 

In  summary,  based  on  our  prototype  plans  and  the  expected  execution 
needs  of  the  contour  extraction  task  and  other  image  and  speech  processing 
algorithms,  certain  desirable  system  architecture  features  have  been  iden¬ 
tified.  These  include  dynamically  switchable  SIMD/MIMD  capabiUty, 
support  for  PE-to-PE  communications  using  DMA  and  intelligent  network 
interfaces,  and  special  arithmetic  function  hardware.  These  requirements  are 
consistent  with  the  capabilities  of  a  VLSI  implementation  of  PASM. 


7.  SUMMARY 

Contour  extraction  has  been  used  as  an  image  processing  scenario  to  explore 
the  advantages  and  implications  of  using  the  PASM  parallel  processing 
system.  Use  of  these  parallel  algorithms  leads  to  several  advantages,  notably 
speedup.  Analysis  of  the  algorithms  has  motivated  the  inclusion  of  several 
important  architectural  features.  These  features  were  used  to  discuss  possible 
configurations  of  a  custom-designed  VLSI  processor  chip  set  for  PASM.  The 
use  of  algorithm  characteristics  to  drive  the  design  of  PASM  leads  to  a 
machine  with  features  that  provide  the  necessary  flexibility  for  executing 
image  and  speech  processing  algorithms. 
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ABSTRACT 

Normalized  Fourier  descriptors  provide  an  effective  but  computationally 
intensive  method  for  performing  object  identification  and  tracking.  Parallel 
algorithms  to  compute  normalized  Fourier  descriptors  are  presented.  The  task 
includes  siib-algorii  hnis  for  conversion  of  chain  code  inputs  to  X-Y  coordinates, 
filtering,  resampling,  Fourier  transform,  and  normalization.  MIMD  and  SIMD 
formulations  are  considered.  1'he  algorithms  are  analyzed  with  respect  to 
computational  complexity  and  communications  requirements.  Fur  typical  problem 
sizes  and  appropriate  choice  of  machine  size  P,  specdups  of  0(P)  are  achieved. 


1.  INTRODUCTION 

Image  |)rocessing  algorithms  are  growing  more  complex  as  research  is  conducted. 
Performance  demands  arc  also  increa-sing  steadily.  The  major  factor  fueling  these 
advances  is  increased  speed  of  computer  hardware.  Practically,  speed  of  computing 
hardware  has  some  limitations.  In  the  future,  gains  in  speed  may  not  be  as 
dramatic.  Even  today,  certain  ta.sks  cannot  be  performed  because  of  computational 
bottlenecks  and  real-time  requirements. 

Clearly,  a  solution  to  these  problems  lies  in  the  replication  of  available 
computing  hardware.  A  challenging  part  of  this  field  lies  in  the  development  of 
parallel  algorithms  for  varied  image  processing  tasks.  In  this  paper,  an 
implementation  of  an  image  processing  algorithm  is  presented.  It  b  representative  of 
a  wide  cla.ss  of  image  proces.sing  algorithms  since  it  b  composed  of  several  sub- 
algorithms,  each  with  different  characteristics. 

2.  ALGORITHM  OVERVIEW 

Given  the  contour  of  an  object  in  a  two  dimensional  plane  as  input,  a  series  of 
frequency  domain  coefficients  which  de.scribes  the  image  is  computed.  These  are  the 
Fourier  descriptors,  which  are  further  processed  in  a  normalization  procedure  so  that 
they  can  be  compared  to  a  library  of  these  descriptors.  Output  of  the  algorithm  b 
the  identification  of  the  object  as  well  as  a  reasonable  estimate  of  its  orientation  in 
space  |0|.  The  algorithm  has  been  proven  effective  in  identifying  and  tracking 
aircraft  in  flight  |lo]. 

Input  to  the  program  consists  of  the  chain  code  representation  of  the  contour  i>f 
an  image  in  a  two  dimensional  plane.  In  practice,  chain  code  inputs  typically 
contain  from  64  to  2048  points.  This  chain  code  input  is  then  converted  to  X-Y 
coordinates  of  the  image.  After  optional  smoothing,  the  image  is  resampled  at 
equally  spaced  intervals  on  the  contour.  Then  a  complex  Fourier  transform  is 
performed  on  the  resampled  points.  This  produces  a  Fourier  descriptor  (FD). 

The  second  logical  division  of  the  algorithm  normalizes  this  descriptor.  The 
gi>al  is  to  s(  ,ile  and  orient  the  contour  by  rule  such  that  the  FT)  from  an  unknown 
contour  will  always  normalize  to  the  correct  library  representation.  Diffi-renl 
normali/.;ilioiis  have  been  proposed  Wallace's  algorithm  [9|  is  investigated  here 
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'I'lir  iiiinii.'ili/aliiiii  is  accntiiplislii'd  :is  folkiws;  I’hf  N  most  signirirant  ooinpli'X 
cciitliciiiils  of  till'  I'D  (N  is  l)|)iially  .Ti),  an-  denoted  as  A(-N/2  +  I)  through 
A|N/2)  I'liis  rri'i|iieiii  y  domain  re|>resenlalioii  of  the  contour  is  normalized  by 
removing  iiiforiiialioii  relating  to  the  relative  (losition  of  the  contour,  its  size,  its 
starling  |ioinl.  and  its  orientation  This  is  .aceoiii|>lished  by  three  steps: 
step  I  Set  \(t))  0 

Tills  removes  all  "IX'"  positional  iiiforiitation. 

M.  |.  J  Divide  All)  l.y  I  \(l)j,  N/2  t  I  <  I  <  N/2 

Itv  detinilioii,  \(  1 1  will  Im'  till*  largest  eoeltien-iil ,  so  this  iioriiializes 
lli«-  si/e  of  I  Ilf  linage  sill'll  that  A(i)  I,  for  all  i 
SI.  |,  .1  Mlllliplv  the  Mil's  l.y  ell"  ‘■I-  *  II  •M/Ik  I) 

k  I'v  Hi<-  I  •Mdicifiii  witli  srroiitl  iitaKiiit  titif*  A(l)) 

li  iM«l  h  art*  (In*  ‘tf  /\(l)  aini  A(k) 

rill-  •viliitill  aiKMitis  «»f  llir  niliitiiili  iiliil  shirlili|( 

iMiis  <iiM>  of  tilt*  iioniiiili/alioits  NalitfyiiiK  ii  -  v  ^  0. 

plates  a  major  avis  of  tin*  r(»iitoiir  IIm* 

ir  k  tills  noriiiali/atioii  is  uni. pie  Otlierwis.*  the  phase  and  starling  point  of  the 

iiorinali/ai mil  iiiiist  be  shifted  to  account  for  the  |k  -  l|  -1  other  possible 
norinali/:il lolls  Then  the  correct  normalization  must  be  chosen  based  on  some  other 
cnteria  The  criterion  examined  here  chooses  the  correct  normalization  as  the  one 
whii'h  iiia\imi/es 

l{e|A(i)|||{e(A(l)|| 

1=  S/  'J  ♦  I 

A  |iar  ilh'l  im|deiiieiil  at  mil  for  the  eoiiiplete  l-'l)  algorithm  will  be  presented. 
The  algorilhiii  is  divnled  into  dis.  iiiet  tasks,  each  of  which  is  examined  individually. 
1'o  ai'hieve  further  parallelism,  the  tasks  could  be  pipelined  to  increase  throughput 
for  real-time  applications. 


3.  MACHINE  MODELS 

Two  models  of  asy  n<  lir.iiioiis  parallel  processing  and  one  model  of  synchronous 
parallel  jiroeessing  will  be  Used  in  the  algorithms.  The  asynchronous  models  will  be 
.MI.MD  (Mnlliple  Instruction  Stream  •  Multiple  Data  Stream)  machines;  the 
syiieliroiioiis  nmdel  will  be  an  SIMI)  (Single  lii.siruetion  Stream  -  Multiple  Data 
Stream)  III  Ii'lillie 

The  organi/.'ilmn  assiinied  for  an  SIMD  machine  will  be  a  set  of  I*  proeessing 
elements  (I'lis),  e.ieh  a  proees.sor  with  its  own  memory;  a  control  unit;  and  an 
iiilereonneelion  nelwmrk  The  control  unit  broadcasts  instructions  to  all  I’Es,  and 
each  active  i*lv  executes  the  instruction  on  the  data  in  its  own  memory.  The 
mierconiiectioii  network  allows  data  to  be  transferred  among  the  PKs.  Examples  of 
this  model  are  MI’P  (Massively  Parallel  Prores,sor)  |l|  and  Siegel’s  PArtitiunable 
SIMD,/.\IIMD  (I’ASM)  system  [•'ij.  .An  MI.MD  marhine  will  be  assumed  to  consist  of 
P  prill  essers,  M  memories  and  an  iiitereonneetion  network  Eaeh  processor  can 
evei'iile  ail  iii.lep.  iident  iiisl  riiel inn  stream.  In  the  Shared  Memory  lill.MD  model, 
the  inleri .iiineei ion  network  is  used  to  allow  all  processors  access  to  all  of  memory, 
lixaniides  of  this  tno.lel  include  the  I'ltraeompuler  [-1]  and  C.mmp  |3|.  In  the 

Priv.iie  Memory  MI.MD  model,  there  is  no  global  store,  each  proccs,sor  has  a  local 
memory  (,\l  =  P|,  and  the  intereonnertion  network  provides  communications  among 
separate  processors.  An  example  of  this  is  PASM  (5j. 

Siii'c  eomtiiiinication  is  a  critical  part  of  parallel  algorithms,  the  types  of 
|■olll^llllMl':ltions  needed  by  each  of  the  algorithms  will  be  analyzed.  This  will  be  a 
fiinetioii  of  the  way  in  whieti  the  data  is  distributed  among  the  proces,sors/momories. 
However,  for  a  given  data  alloeation,  the  precise  communications  requirements  can 
he  olii  iiiie.l  for  SIMI)  and  Private  Memory  MINI!)  .systems.  These  will  be  expressed 
III  I.  rills  of  a  few  I'oiiiiiion  intereonnertion  functions.  In  SIMD  mode,  the  transfer  will 
11  cur  . . .  for  all  active  processors  p,  0  <  p  <  P;  in  a  Private  Memory 


\IIMI)  iii:ii  liiiH',  llic  tr.iii^fi  r  will  1*1'  a  ri-<|iii'st  fniiii  dim-  I’I')  In  nlilain  data  frniii 
aliothir  I’lTs  ini'Minr\.  (In  Sl\tl>  alt'-iritliiiis  it  ninri'  ('niiiiiinii  to  ^cl|l^illl■^ 
traiisfrrniiji;  il.ila  from  a  giii-ii  I ‘I  !  to  oiinlln-r  I’ll.  In  MIMI)  alRnrilliiii.s,  ii  ma)!  be 
Miore  iialiiral  In  (■l)ll^i^l(•r  a  data  aoii'ss  as  a  Iraiisfcr  »>f  dala  from  a  nnii-liical  iiii‘iii<iry 
to  a  Rivi'ii  1*1)  Siiii  c  III)'  inli'riaiiiiHM'iliiii  riiiiclioiis  wo  aro  iisiiiR  aro  sj  niiin  l  ric,  this 
disi ini'l  ini!  will  not  inattrr.)  riic  iiitorcnniicctinn  fiiiirliniis  noodod  will  he: 

(1)  the  (lass  nf  nhijl  fiinrtiniis,  where  shift  ±d  transfers  data  from  I’K  p  to  I’l’i 
(p  id)  mod  I*. 

(2)  the  (dass  nf  rube  fiincl inns,  where  if  r  =  InK-.-l’  and  p,  ,  ■  •  pj  ■  pp  is 

the  hinar)  represeiilatinn  nf  p,  0  <  p  <  )’.  Ilieii 

cuhe.tp)  =  p,  ,  p,  Po 

where  p,  is  the  cnlliplemenl  nf  p,. 

4.  DECOMPOSITION  INTO  PARALLEL  ALGORITHMS 

In  this  seelion,  parallel  algorithms  are  described  for  each  of  the  sublasks 
required  for  generating  norinali/ed  Kourier  descriptors.  The  algorithms  are  for 
conversion  from  chain  code  to  X-Y  coordinates,  filtering,  resampling,  Fourier 
transform  calculation,  and  FD  normalization. 

4.1  Input  Conversion 

It  will  be  assumed  that  the  contour  of  the  image  is  entered  in  chain  code 
reprtsenlalinii.  Figure  la  shows  a  typical  representation  for  an  8  nearest  neighbor 
chain  code  The  location  of  point  p,  is  dependent  upon  the  points  pg  through  p,  | 
The  horizontal  and  vertical  segments  (0.‘i,i,6)  have  length  I;  the  diagonal  segments 
(1, 3, 5,7)  have  length  \/i. 

An  example  25-point  conloiir  with  its  chain  code  representation  is  given  in 
Figure  lb.  Chain  code  inptils  of  practical  use  contain  from  a  few  hundred  to  a  few 
Ihousand  points.  This  is  a  variatde  number  which  depends  on  the  rtdative  size, 
shape,  ami  perspective  of  the  (dijecl  being  identified.  The  number  of  chain  code 
inputs  will  be  assumed  to  be  ('. 

(  h.iiii  code  input  is  inherently  serial  since  each  input  is  meridy  an  ollset  from 
the  previous  input  Two  jiarallel  algiuilhms  for  this  normally  serial  las);  will  b" 
described.  Initially  assume  that  I'  =  vC.  Thusjhe  C  inputs  can  be  dividei)  among 
F  I’Es  and  (  ach  I’E  will  be  responsible  for  vC  chain  code  inputs  This  ran  be 
illustrated  by  forming  the  input  logically  into  a  two  dimensional  array.  Tlie  array  of 
input  points  will  be  denoted  as  CCIn(0. C-l)  Figure  2  shows  the  division  of  the 
contour  into  segments  and  arranges  each  segment  into  a  line  of  the  array  We  rati 
then  define  parallel  operations  in  which  each  I’E  acts  on  a  row  or  column  of  this 
array. 

The  first  parallel  algorithm  uses  the  Shared  Memory  MIMI)  modid  Initially 
each  row  of  the  input  is  processed  by  a  separate  IT)  This  is  eipiivalent  to  dividing 
the  contour  into  I’  contiguous  segments,  with  each  I’F)  responsible  for  one  segment 
Each  I’E  can  then  assiitiie  that  it  hits  the  "first"  segment  of  the  contour  and  a.ssign 
the  coordinates  (0,0)  to  the  first  point  It  ran  then  compute  the  X-Y  ciHirdiiiates  of 
the  rest  of  its  point-  starting  from  this  reference  (Jiven  v/T*  chain  code  inputs  in 
each  segment,  each  I’E  .issiitiies  the  first  point,  then  generates  X-5  ccMirdinates  for 
v/C  adiiit ional  poitits.  Thus,  the  last  point  generated  in  I’E  p  corresponds  to  the 
first  point  for  I’E  p -I- 1  (the  point  previously  assumed  to  he  (0,0))  X-^'  ccKirdinates 

for  all  the  input  points  have  now  been  generated,  however  each  row  of  the  sipiare 
(each  segment  of  the  contour)  has  a  different  origin  in  the  X-Y  plane.  Now  a 
correction  step  is  employed.  Denote  the  X-Y  riKirdinates  of  input  point  i  as 
XY(i),  0  <  i  <  <’.  Since  the  origin  is  arbitrary,  set  it  at  the  point  X^(0),  that  is 
X^  (0)  s  (0,0)  'I'hen  the  previous  step  correctly  cotiiputed  the  ciairdinales  of  XY(0) 


Kig.  lb  b'x ample  25-point  chain  code  and  contour 


C'Cln(0|  CClnd)  CCIn(2)  —  CCIn(P-l) 
C(ln(P)  C<ln(P  +  l) 

(■('ln(2P) 


CCIn((P-I)P)  ■  ■  '  CClD|P*-l) 


Fig.  2.  Division  of  chain  code  points 

IliriMigli  To  corri'Cl  the  c<H>rdiiiales  of  XVlP)  through  XY(2P|  in  the  second 

srgiiieiii,  we  lllll.^^  add  to  each  of  these  the  coordinates  of  X^'(P)  computed  in  the 
first  segment  Subject  to  memory  access  constraints,  these  P+l  corrections  can  be 
done  concurrently  'I'hen  to  correct  points  XY(2P)  through  X\'(3P)  in  the  third 
segment  we  must  add  the  (newly  corrected)  XY)2P)  from  the  second  segment.  All  of 
the  segment  S  can  be  corrected  in  parallel,  however,  segment  S  must  be  corrected 
before  segment  S+  1,  for  1  <  S  <  P-|.  This  correction  must  be  done  in  order,  for 
each  row  of  the  square 

This  algorithm  can  easily  he  generalized  to  any  number  of  input  points  by 
assigning  |f7l’l  consecutive  points  to  each  of  the  6rst  P  ~  1  processors,  and 
<'-(P-l HC/I’I  points  to  the  last  processor.  Some  efficiency  will  be  lost  if  all 
proces.sors  do  not  contain  the  same  number  of  points. 

In  order  lo  esliiiiale  the  amount  of  computation  performed,  some  as.sumptions 
about  the  number  and  typ<-s  of  statements  will  be  made.  Initially,  synchronization 
overhead  and  memory  conflicts  will  not  be  considered.  The  basic  operations 
performed  in  the  parallel  algorithm  are  the  conversion  from  chain  code  input  to  X-Y 
coordiii.nles  based  on  an  arbitrary  origin  and  the  correction  of  the  X-Y  coordinates  to 
the  correct  origin  Assume  that  the  functions  CtoX()  and  CtoY()  convert  one  chain 
code  input  lo  the  proper  increment  in  the  X  or  Y  direction.  This  can  be  done  with  a 
simple  ca.se  statement  or  a  conversion  table  in  which  each  of  the  8  possible  chain 
code  inputs  maps  to  the  appropriate  X  and  Y  increments.  Then  the  coordinates  fur 


X^|i+I)  ran  be  obtained  fruin  X^(i)  and  the  i-th  chain  rode  input  by  the  pair  of 
statement:): 

X(i  +  1)  ^  X(i)  +  CloX(CCIn(i)) 

Y(i  +  1)  *-  V(i)  +  CtoY(Can(i)) 

This  can  be  considered  to  be  2  additions  and  assignments  using  real  arithmetic  or 
one  addition  and  assignment  using  complex  arithmetic.  The  serial  algorithm  consists 
of  C  calls  to  the  conversion  functions,  C  complex  additions,  and  C  complex 
assignments.  The  parallel  approach  executes  in  the  time  for  y/C  calls  to  the 
conversion  fiinrlionv  assignments,  and  \/C-|  complex  additions  for  the  initial 
conversion,  plus  \/C-l  complex  additions  and  assignments  for  the  correction. 
Assuming  the  dominant  operation  is  the  additions,  the  speedup  on  computations  is 
approximately 

S  =^y/C. 

2v/C  -  1  2\/C  2 

For  P  =  ^/(^  S  ~  P/2. 

Consider  the  memory  references  re(|uired  in  the  above  algorithm.  If  the  data  is 
viewed  as  a  matrix  with  l‘  data  points  on  a  side,  each  proces.sor  operates  on  a  row 
and  then  a  column  of  that  matrix.  In  a  parallel  system  with  global  memory,  the 
store  is  typically  divided  into  several  memory  units.  Optimum  elTiciency  comes 
about  when  each  processor  is  accessing  a  different  memory  unit  during  a  given 
memory  cycle,  since  each  menxiry  unit  can  deliver  only  one  word  per  memory  cycle. 
An  obvious  way  to  distribute  the  data  is  to  put  each  segment  of  the  contour  (row  of 
the  matrix)  in  a  separate  memory  unit.  During  the  first  half  of  the  conversion,  each 
processor  acts  on  a  row,  so  the  memory  system  operates  with  ideal  efficiency.  During 
the  second  half  of  the  conversion,  every  processor  acts  on  the  same  row 
simultaneously.  This  creates  a  large  bottleneck  at  the  memory  unit  containing  that 
row.  Kuck  discusses  this  problem  in  [Kuc77)  and  suggests  skewed  storage  techniques 
that  eliminate  these  bottlenecks  at  tlie  cost  of  more  complex  address  computations 
in  array  accesses.  There  is  an  overhead  involved  in  every  array  access,  thus  reducing 
the  speedup. 

In  the  SIMD  or  Private  Memory  MIMD  model,  it  is  assumed  that  accesses  to  the 
local  memory  can  occur  without  contention  from  other  processors.  Consider 
rewriting  the  algorithm  to  use  only  local  memories.  The  initial  step  of  the  algorithm 
is  unchanged:  each  PK  obtains  X-Y  coordinates  for  one  segment  of  the  contour, 
assuming  an  arbitrary  origin.  Then  recursive  doubling  is  done  to  produce  correction 
values  for  all  the  segments  of  the  contour  at  once.  Recursive  doubling  is  a  method 
of  computing  accumulated  sums  across  proceswors  |8).  To  show  this  in  a  program 
segment,  assume  a  call  to  rer_dbl(val)  uses  the  value  val  and  takes  care  of  all  the 
communications  to  perform  the  recursive  doubling.  If  val(i)  refers  to  the  value  of  val 
in  PE  i,  then  for  all  PEs  p.  0  <  p  <  P,  rec_dbl(val(p))  will  return  the  partial  sum: 

rec_dbi(val(p))  =  ^  val(i) 
i=o 

An  example  of  recursive  doubling  is  shown  in  Figure  3. 

h’or  the  Private  Memory  algorithm,  the  restriction  that  P  =  y/C  is  relaxed.  The 
only  assumption  made  is  that  there  are  D  input  points  in  each  processor’s  local 
memory  in  the  array  CCIn(0.  D-l).  This  algorithm  stores  a  segment  of  the  contour 
in  each  PF"s  local  memory.  Initially,  each  PE  computes  coordinates  based  on  the 
assumption  that  its  first  point  is  at  the  origin  (0,0).  Knowing  the  relative 
coordinates  of  the  last  point  in  each  segment,  the  absolute  ctxjrdinati*s  of  the 
beginning  of  each  segment  can  be  computed  as  follows.  The  correction  for  PK  1  is 
given  by  the  ccxirdinates  of  the  last  point  in  PE  0;  the  correction  for  PIC  2  is  given 
by  the  sum  of  the  coordinates  of  the  last  points  in  PEs  0  and  I;  in  general,  the 
correction  for  PE  p  is  the  sum  of  the  coordinates  of  the  last  points  in  PEs  0  through 
p  -  1.  Recursive  doubling  is  used  to  compute  all  the  needed  sums  simultaneously. 
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Kig.  Kcfursive  doubling  example  for  8  I'Ks 

OiKi'  i':i<  h  ri^  ti.is  ilii‘  nlisiiliilf  <'ii(iriliiial«‘s  for  its  first  point,  it  <’an  rorri'ct  the  rest 
of  its  '-I'Riii'  Ml  loi  .illj,  'I’liis  sli'p  will  !«•  done  conourrently  in  all  Tl’is.  'I'hf  algorithm 
is  giM  ii  III  f  igure  I. 


/•  l.ocal  variable 

I;  '■ 

(  (  liil^O  1)-|) 


X(()  I 
VfO  !>' 

SUIIIX 

suniy 


ii 


definitions  •/ 

Number  of  proet'ssing  elements  •/ 

Number  of  data  points  in  each  processor  •/ 
Input  chain  code  for  one  contour  segment  •/ 
X  coordinates  for  this  contour  segment  */ 

Y  c<K>rdinates  for  this  contour  segment  •/ 
I'artial  sum  of  all  X  coordinates  »/ 
i’artial  sum  of  alt  Y  coordinates  */ 


X(0)e-\(0)-0 

sumx  .sumy«-0 


/•  Coinpule  X-^  eoordmales  for  all  points  •/ 

FOR  1-0  THROUGH  1)1  DO 
BEGIN 


X(i  f- 1 
V(i  +  I 


+  (  toX((  (  Infill 
tCtofiCCInlijj 


END 


/*  l'om|iiiie  eorreciion  factors  ill  parallel  */ 
sumx*  ree_dbl(X(l))l  /•  log^.l’  transfer  steps  */ 

sumx  *-sunix -X|l))  /•  Only  consider  offset  from  previous  segments  •/ 
sum)  •- ree_dbl(S’(n)l  /•  log.J’  transfer  steps  »/ 

suiiiy*-siimy-V(l))  /•  Only  consider  offset  from  previous  segments  •/ 
/•  Corrix  l  each  seemeni  locallv  •/ 

FOR  I*  I  THROUGH  l)-l  DO 
BEGIN 

X(i)*-X|il  +sum\ 

^  (i)— V(i)  +  Slimy 

END 


f  ig  I  Input  conversion  algorithm  for  a  Private  Memory  MIMI)  machine 
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Ilcri'  III!'  nuiiibiT  Ilf  iii|iii(  is  C'  =  l’•l).  The  Dverall  ruiiipiit»ti<iiial 

eoniplexity  is  (iriiimrliniMl  («i  D+lug.jl*  If  the  jissiimption  is  kept  that  ('  =  I’*,  then 
I)  =  P  =  '/v,  and  the  eiiinplexily  is  x/C  +  log.jV^(',  etnnpared  with  a  enniplexily 
proportional  to  ('  for  the  serial  algorithm  The  operations  in  the  first  part  of  the 
algorilhiii  (i.e.,  lli(‘  local  chain  code  to  c<M>r<liiiate  eonviTsions)  arc  (he  same  as 

those  used  in  the  serial  algorithm,  hut  are  perfiirmed  in  C/P  steps  instead  of  C  The 
remainder  of  the  parallel  algorithm  is  all  overhead  The  recursive  doulding  reipiires 
logjP  complex  additions  and  assignments  It  also  reipiires  log.jP  points  of 
synchroni/ation.  The  local  correction  step  takes  time  proportional  to  C/l’,  hoHexer. 
each  operation  is  simply  a  complex  addition,  xxhich  takes  less  time  lli.iii  the  original 
chain  code  to  X-^  ciKirdmate  conversion  step  The  time  is  therefore  dominated  hy 
the  original  conversion  step.  An  ()(P)  speedup  is  expected,  acionnting  for  the  extra 
steps,  an  actual  speedup  of  P/2  is  conservative. 

Summarising,  two  algorithms  for  the  input  conversion  have  been  presented. 
Doth  methods  are  fairly  regular  and  could  be  done  on  an  .SIMI)  machine  The  first 
method  is  well  suited  for  a  Shared  Memory  MIMI>  machine,  and  the  second  method 
works  well  with  either  MIMI)  machine  model.  The  first  method  effectivelv  uses 
broadca.sts  |by  placing  values  in  memory),  while  the  second  method  Uses  shift  -t  2' 
functions,  0<i<log._,P,  for  the  recursive  doubling.  Consider  re|)resenting  the 
complexity  of  the  algorithm  as  being  proportional  to  oC/l’ +  log;!’.  The  actual 
choice  of  P  will  in  general  he  made  bxsed  on  speed  constraints  of  the  a)iplication  and 
the  range  of  values  of  ('.  W  e  would  like  to  estimate  the  largi-st  "rea-siinable" 

value  for  P.  As  an  arbitrary  measure,  if  we  say  that  we  want  the  cost  of  the 
computation  to  dominate  (i.e  ,  nC/P  >  log.jP)  and  let  o  =  2,  then  for  small 
contours  (t.'  =  fit),  P,„j,  =  It),  and  for  large  contours  (C  =  2018),  P^„  =  256. 

4.2  Filtering 

The  filtering  of  the  image  is  an  optional  step  to  remove  some  of  the 
quantization  noise.  Typically  this  is  a  smoothing  operation,  in  which  each  point  is 
replaced  by  the  (possibly  weiglitx-d)  average  of  itself  plus  W  neighboring  points.  This 
can  be  done  easily  in  parallel  by  giving  each  processor  a  section  of  the  contour. 
Given  a  filtering  window  width  \V,  each  processor  will  need  to  access  (\V72)-I  |)oints 
from  each  adjoining  section.  This  could  be  accomplished  by  at  most  W’  transfer 
steps.  If  a  memory  system  is  used  where  acci-sses  to  adjacent  memories  are  allowed, 
it  is  important  that  ''wrap-arounir  can  occur.  That  is,  PIv  P-l  should  be  a 
“neighbor"  to  PK  0.  For  tnore  discussion  of  the  filtering  problem  in  general,  see  [7]. 

Overall,  in  this  portion  of  the  algorithm  speedups  on  the  order  of  P  can  be 
expected  for  small  values  of  W.  For  large  \V,  the  number  of  accesses  to  data  in 
adjacent  proces.sors  may  be  significant  Then,  properties  of  the  parallel  system  will 
have  a  greater  effect  on  the  total  processing  time.  These  properties  include  methods 
of  memory  accesses  and  interconnection  between  processors/memory 

In  the  filtering  stej),  only  shift  ±1  communication  is  needed.  The  number  of 
usable  I’l-'s  is  related  (  i  (he  number  of  point.s  per  PK  and  the  width  W.  .Speed 
constraints  may  dictate  how  many  points  must  be  filtered  by  each  PK.  Speedup  can 
be  incre.ased  by  increasing  P.  On  the  other  hand,  for  large  W,  the  relative  effect  of 
the  transfers  can  be  reduced  by  decre.asing  P  and  thus  increasing  f’/P.  Since 
filtering  is  a  regular  operation,  it  could  be  done  easily  on  .SIMI)  as  well  as  MIMI) 
machines. 

4.3  Resampling 

The  input  outline  needs  to  be  resampled  since  the  Fourier  descriptor  algorithm 
requires  equal  distances  between  input  samples.  From  chain  code  input,  the  diagonal 
segments  are  longer  by  a  factor  of  v/2. 

The  basic  ajiproach  to  resam|ding  is  to  compute  the  length  of  (he  entire  ('-point 
contour,  then  resample  it  to  II  evenly  spaced  points  The  length  within  each  PK  can 
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))i‘  c<>iii|>ul«'(l  with  a  sp)'<'du|>  of  P  and  the  partial  and  total  sums  across  the  I‘ps  can 
be  computed  in  log;>P  steps  using  recursive  doubling.  After  the  total  length  has  been 
obtained,  the  contour  is  divided  into  P  groups  of  points  such  that  all  groups  have 
equal  length.  i'>ach  PK  will  compute  the  resampled  points  for  its  own  group.  In  the 
conversion  and  filtering  stages,  each  PK  held  the  same  number  of  points.  Here  the 
Pi'ls  hold  equal  length  groups  of  points,  ^d  the  number  of  original  points  between 
two  groups  may  diifer  by  as  much  as  v2.  Once  the  boundary  locations  between 
groups  have  been  determined,  contour  points  may  have  to  be  moved  between 
adjacent  Phis  to  achieve  the  <livi:)ion  into  equal  length  groups.  Once  the  appropriate 
points  are  collected  into  a  PK.  it  can  compute  the  resampled  points  for  its  own 
group  Since  each  PP’s  group  has  the  same  length,  each  PE  will  compute  the  same 
numtier  of  resampled  points. 

Duniig  ibis  resampling,  eaeh  processor  operali-s  primarily  on  local  data.  The 
only  iieeil  for  iion-loeal  dtila  oet  nrs  at  the  emis  of  the  eontoiir  segments.  The 
am. amt  of  noii-loeal  data  re<piired  depends  on  the  resaiii|ding  technique  employed. 
Siince  most  d.ita  is  local,  memory  access  is  not  a  problem.  During  the  recursive 
douliliiig.  the  iiiirrconneeiion  nelwork  will  be  used.  Thus,  any  architecture  in  which 
the  communii  .'itions  facilities  can  easily  support  nearest  neighbor  (shift  ±1)  and 
reeursiM'  doiihliiig  (shift  +2'|  transfers  should  run  this  algorithm  well. 

Alilioiigli  SI.MI)  iiiacliiiies  can  be  Used  for  resampling  |ll|,  NtIMI)  eveculion  is 
more  smiable  here  ticiaiise  of  the  possible  irregularities  in  the  distances  between  the 
original  ^.•lmplc^  liiiher  MIMI)  model  should  perform  well  Again,  is  chosen  so 

Ih.il  llic  iiniiila  r  of  poiiiK  III  each  PE  is  large  enough  so  that  the  amount  of  work  is 
sigiiilic,iiil  compared  to  the  parallel  overhead.  The  range  of  P„,,,  as  a  function  of  f’ 
will  he  approx  iiiiali'ly  the  s.iiiie  .i-s  for  the  input  conversion  algorithm 
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4.4  Fourier  Transform 


The  FI)  is  obtained  by  computing  the  first  32  points  of  the  DFT  on  the  R-poinl 
resampled  contour.  Here  an  FFT  algorithm  utilizes  the  PEs  well.  Since  the  number 
of  contour  points  may  be  as  large  as  20W,  but  only  32  frequency  domain  coefTicients 
are  reiptired  for  the  FI),  it  may  seem  that  a  DF'T,  computing  only  the  32  coefficients 
needed,  could  provide  similar  speedups  Unfortunately,  the  DFT  suffers  from  the 
need  to  broadca.st  all  li  points  to  all  PEs,  and  does  not  approach  the  low 
compulalioiial  Cost  of  the  FFT  for  the  range  of  H  of  interest. 

Using  (he  parallel  FFT  algorithms  in  a  radix-2  K-point  FFT  ran  be 
computed  in  P  processors  (I’  a  (lower  of  2)  in  complex  multiplication  steps, 


K  K 

-Jj-logjK  complex  addition  steps,  and  transfer  steps.  If,  for  example, 

P  =  32,  then  each  PE  will  hold  R/32  input^amples,  and  the  execution  is  dominated 
by  -|Ylog;.R  complex  multiplications  and  complex  additions.  In  addition, 

—  K  transfer  steps  are  needed.  These  transfer  steps  represent  the  overhead  of 

parallel  execution  and  could  account  for  execution  overhead  near  the  time  required 
U)T  the  multqilication  and  addition  steps.  Even  so.  for  R»P,  the  speedups  are  no 
worse  than  172,  thus  the  asymptotic  speedup  for  this  portion  of  the  algorithm  is 
O(l’).  In  order  to  accomplish  these  gains,  a  communications  facility  is  required  to 
transfer  the  data  at  each  point  of  synchronization.  Because  of  the  high  degree  of 
synchroiiizalioii  required,  the  FFT  is  best  suited  for  SIMD  rather  than  MIMD 
impleriienl.il loll  The  data  transfers  are  cube,  functions,  0  <  i  <  logjP,  and  are 
done  frc<(iieiitly  ,\ii  MIMI)  machiiie  of  either  type  would  be  slowed  by  the  large 
amount  of  roiiiiminiealion  and  synchronization. 

For  an  R-iioliit  radix-2  point  FFT,  as  many  as  R/2  PEs  could  be  used. 
However,  it  will  be  praclieal  to  use  the  .same  number  of  PEs  as  were  used  in  the 
previous  algonllirn  |resampling|,  so  that  data  reallocation  is  not  needed. 
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4.S  Descriptor  Normaliiation 

The  remaining  step  n(>rmaliz(>s  the  roeHirients  by  rule  so  that  they  can  be 
compared  to  a  library  of  contour  coeflicients.  The  number  of  cocHicienls  is  N  which 
is  typically  di.  Suppose  f'  =  N  =  32.  To  normalize  the  coefficients,  A(0)  is  set  to  0 
and  all  values  are  scaled  by  JA(I)|.  This  requires  one  broadcast  and  one  parallel 
division.  To  find  which  coefficient  is  largest,  the  magnitudes  can  be  computed  in 
parallel,  then  the  comparison  can  be  performed  in  logji’  (^.S)  transfers  and 
comparisons  using  recursive  doubling.  The  speedup  would  be  on  the  order  of  only 
I’/logol*  =  6.2  for  this  small  section.  Then  depending  on  parameters  of  A(0)  and 
A(k),  the  starting  point  and  origin  are  shifted  appropriately.  This  is  done  once  if  k 
=  2,  otherwise  it  is  done  |k-  l|  times.  Speediips  can  be  estimated  from  the 
operations  involved  in  shifting  the  origin  or  starting  points.  Kither  of  these  can  be 
computed  easily  by  multiplying  each  coetlicienl  by  a  complex  factor.  This  factor  is 
the  same  across  all  processors  for  the  origin  adjustment  and  it  is  computed 
individually  for  the  starling  point  adjustment.  No  communication  or 
synchroni/ation  is  needed,  so  any  hfIMI)  system  should  handle  these  well.  The 
speedups  then  will  be  .S  fa  I’  =  32  for  the.se  shifting  operations. 

When  more  than  one  of  these  normalizations  are  done,  the  “correct" 
normalization  is  computed  a.s  the  one  with  the  maximum  sum 

N/2 

V  ltelA(i)l|Ke(A(i)l| 

N/2  e  I 

These  terms  can  be  computed  in  parallel  with  optimal  speedup.  The  sum  can  then 
be  formed  in  log2N  steps  and  compared  on  a  single  processor  to  each  of  the  other 
normalizations. 

Instead  of  dealing  with  a  few  humlred  or  a  few  thousand  data  points  of  varying 
number,  this  procedure  deals  with  only  N  (N=:32)  data  points.  Thus,  care  must  be 
observed  in  estimating  speedups  since  synchronization  overhead  may  be  a  significant 
factor  in  execution  speed.  For  this  part  of  the  processing  a  SIMD  system  may  be 
more  efficient.  MIMD  systems  would  need  to  have  efficient  synchronization 
mechanisms  to  perform  well.  Overall,  however,  a  somewhat  lcs,ser  speedup  in  this 
section  will  not  significantly  affect  the  execution  time,  since  the  number  of  data 
items  has  dropped  from  several  hundred  to  32,  and  the  complexity  of  the  operations 
performed  in  this  step  is  not  high.  Since  each  F‘K  could  contain  as  few  as  one  point 
each,  l’„|„  could  he  32. 


6.  CONCLUSIONS 

The  use  of  a  parallel  machine  could  speed  up  the  normalized  Fourier  descriptor 
algorithm  sigiiilicaiitly  For  practical  contours,  the  number  of  livs  that  can  be 
effectively  used  is  apfirox  imately  16  or  32.  The  types  of  transfers  required  are  the 
cube,,  shift  ±1,  and  shift  +2'  functions,  for  0  <  i  <  logoP.  Some  sub-algorithms 
are  better  suited  to  MIMI)  architectur»-s,  others  to  SIMI)  architecturw.  Together, 
the  collection  of  algorithms  that  comprise  the  FI)  task  demonstrates  a  variety  of 
techniques  in  parallel  processing  and  shows  that  substantial  speedups  can  he 
achieved  Using  parallelism 
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Abstract 

Moat  research  in  interconnection  network  analysis 
has  been  based  on  topologically  regular  (uniformly 
structured)  networks.  As  hardware  becomes  less  expen¬ 
sive,  more  and  more  distributed  algorithms  will  be 
implemented  by  special  purpose  multiprocessor  systems. 
In  this  paper,  a  formal  graph/algebraic  model  of  special 
purpose  (topologically  regular  and  irregular)  networks  is 
presented.  These  analysis  techniques  can  be  used  for 
(a)  system  emulation;  (b)  fault  tolerance;  and  (c)  parti¬ 
tioning  of  systems. 


I.  iBtrodaetlon 

Most  research  in  interconnection  network  analysis 
has  been  based  on  topologically  regular  interconnection 
networks  such  as  the  ILLIAC  |.1|,  Shuffle  |ll|,  multis¬ 
tage  Cube  |l],  single  stage  Cube  [19|,  STARAN  [2|, 
ADM  |I3|,  Mesh  |I4),  and  PM21  ll8].  As  hardware 
becomes  less  expensive,  more  and  more  distributed  algo¬ 
rithms  will  be  embedded  into  special  purpose  multipro¬ 
cessor  systems  [IS,  16,  17|.  A  system  informally  con¬ 
sists  of  a  set  of  devices,  an  interconnection  network, 
and  a  rule  which  defines  the  usage  of  the  network.  A 
device  will  be  assumed  to  have  two  ports;  one  input  and 
one  output.  A  typical  device  might  be  a 
processor/memory  pair,  a  processor  only,  or  a  memory 
only.  Distributed  algorithms  for  multiprocessors  may 
give  rise  to  special  purpose  irregular  interconnection 
networks.  Some  effective  modeling  and  analytical 
methods  to  study  these  irregular  networks  are  needed. 
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Problems  that  will  benefit  from  precise  analytical 
methods  include: 

(1)  the  use  of  system  A  to  emulate  system  B  (three 
different  degrees  of  strictness  of  emulation  are  dis¬ 
cussed); 

(2)  fault  tolerance/rcliability  (achieved  by  multiple 
mappings  of  the  same  problem  into  a  system);  and 

(3)  partitioning  of  a  system. 

Some  work  has  been  done  on  these  problems  for  regular 
interconnection  networks,  for  example  for  (l)“quotient 
networks*  (5)  and  for  (3)  partitioning  theory  (20,  21). 

The  methods  developed  here  will  allow  a  well 
defined  comparison  between  topologies  of  systems.  For 
example  if  system  A  is  related  to  system  B,  and  system 
B  is  related  to  system  C,  then  it  may  be  possible  to  say 
something  about  the  relationship  of  system  A  to  system 
C.  The  similarity  measures  are  of  three  basic  types, 
with  each  one  stricter  than  the  previous  one. 

The  material  is  presented  as  follows:  after  each 
major  definition  or  theorem  a  brief  example  of  its  appli¬ 
cation  is  given.  In  this  paper  it  will  be  assumed  that 
the  reader  is  familiar  with  basic  graph  theory  (4,  Q|  and 
basic  abstract  algebra  |8,  10). 

In  section  II  some  basic  concepts  are  defined.  The 
model  of  interconnection  networks  to  be  used  »  given  in 
section  ITI.  In  section  fV  the  definitions  of  a  system  and 
three  types  of  subsystems  are  presented  and  their  pro¬ 
perties  analyzed.  The  concept  of  a  "quasimorpbism”  is 
explained  in  section  V.  Its  usage  in  analyzing  the  emu¬ 
lation  and  other  problems  is  exemplified.  Finally,  in 
section  VI,  the  global  conclusions  of  this  paper  are  dis¬ 
cussed. 


n.  Bask  Definitions 

In  this  section,  basic  definitions  needed  as  back¬ 
ground  for  the  rest  of  the  paper  are  given.  A  general 
model  of  an  interconnection  network  is  shown  in  Fig.  1. 
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C  =  {Co.  C,} 

C„  =  {(A,D),(B,E)} 

C,  =  {(A,E),(C,D)} 

V,  =  {A,B,C}  ,  Vo  =  (D,E) 

Fig.  1.  GcBCTftl  model  of  »n  inierconnection  network. 
Definition  t.t: 

Let  V|  be  the  set  of  input  labels  of  a  network,  and 
let  Vg  be  the  set  of  output  labels  of  a  network 
such  that: 

V,  n  Vo  =  0.  V,  0.  Vo  0,  where  0  is 

(he  empty  $et. 

Then  C„  C  V,  X  Vo  {(v..Vfc)|  y.6V,.v,6Vo} 
is  called  the  //O  correspondence  of  Vj  with  Vo. 
IPhysically,  represents  one  state  of  a 

recon figurable  network). 

Definition  S.t: 

S  VjxVo  be  an  I/O  correspondence,  then 
S(C„)  ^  {v,|  (v„v^)  €  C„)  is  called  the  source 
•>/ 

Definition  S.S: 

Let  C„  C  V|xVo  be  an  I/O  correspondence,  then 
D(C„)  ^  {Vb|(v„v^)  6  C„)  is  called  the  destina¬ 
tion  set  o/  C„. 

Definition  S.4: 

Let  {C„}  be  a  set  of  I/O  correspondences,  then 
S((C„})  A  U  S(C„). 

m 

Definition  t.S: 

Let  {€,)  be  a  set  of  I/O  correspondences,  then 
D({C,))  A  U  D(C,). 

B 

Definition  t.$: 

C  V|xVo  be  an  I/O  correspondence. 

If  Vb  V(v„Vb),  (v„v,,)  e  C„  then  the 
correspondence  has  the  property  of  nondesirac- 
livitf. 


Definition  2. 7: 

Let  A  be  a  set,  then  P(A)  ^  {S  |  SCA)  is  the 
power  set. 

Definition  2.8: 

Lrl  0  be  a  map  from  A  to  B.  Let  EQA  then 
0(IC)  ^  (b£ll|  0(a)  =  b,  aCE}  is  the  imape  of  E 

under  0. 

ni.  Intcreonnectlon  Network  Model 

In  this  section,  a  formal  graph/algebraic  model  of 
an  interconnection  network  is  presented.  This  model 
will  be  used  to  define  a  system  in  section  IV. 

Graph  models  for  analyzing  networks  have  been 
used  by  other  researchers.  For  example,  in  (6,  7,  12,  22) 
they  are  used  to  analyze  Banyan  networks,  and  in  |S| 
they  are  used  to  study  the  partitioning  of  regular  neU 
works.  The  model  presented  here  differs  from  |ft,  7,  12, 
22)  and  |S]  by  being  cbmpletely  general  so  that  it  can  be 
used  to  describe  an  arbitrary  (including  topologically 
irregular)  interconnection  network. 

Definition  S.  t: 

I/O  repreeenitttion  of  network. 

(1)  V|  -  set  of  input  vertices 

(2)  Vq  -  set  of  output  vertices 

(3)  C  -  set  of  I/O  correspondenect  C„,  where 
CmCV.xVo 

(4)  \/C„£C,  where  C*  hu  the  property  of 
nondestructivity 

(5)  S(C)=V, 

(6)  D(C)=Vo 

"K"  will  be  used  to  denote  a  network. 

Notation:  K  =  (C)  =  ({C,})  =  ((((v^Vb)))) 

(the  notation  ((v^.v^))  indicates  a  set  of  one  or 
more  pairs  of  vertices). 

I’hytieal  implieotiont:  (v,,Vb)  £  C„  represents  network 
moving  data  from  input  v,  to  output  v^  when  the  state 
of  the  network  is  C,„.  C  represents  the  set  of  all  passi¬ 
ble  states  of  the  reconfigurable  network. 

rv.  Systems  sad  Snbsystcms 

In  this  section,  formal  graph/algebraic  definitions 
of  a  system  and  three  types  of  subsystems  are  discussed. 
Also  shown  are  basic  properties  of  the  three  types  of 
subsystems.  Some  theorems  about  subsystems  are 
presented  and  brief  examples  of  their  applications  are 
given. 

The  mathematical  definition  of  system  given  in  this 
section  can  be  used  to  model  the  following  object.  It 
can  be  interpreted  as  a  parallel  computer  system,  where 
the  vertex  v,€V|  corresponds  to  a  device  output,  Vb€Vo 
corresponds  to  a  device  input  and  C,,  to  a  state  of  a 
physical  network. 
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Definilion 

Cp  —  fetdbatk  eorrttpondenre.  Let 
1*^  ~  ({{(''■•''bl}))  *  network.  If  the  uss|te  of 

the  network  is  such  that  data  outputed  at 
ran  be  fed  bark  in  v,6V|,  then  (v,,v^)eCr. 

Pkpnfal  impliraliont:  This  describes  the  situation  where 
a  processor  or  any  other  device  is  connected  to  both 
and  V,.  The  device  inputs  data  into  v,€V|  and  receives 
data  at  v^^Vq.  Thus  if  (v,,Vy)€Cp  then  the  same  dev¬ 
ice  is  attached  to  v,  and  v^.  If  (v,,Vy)^Cp  then  a 
separate  device  is  attached  to  each  of  v,  and  v^.  Since 
it  is  assumed  that  each  device  has  only  one  input  and 
one  output,  and  that  a  vertex  can  have  at  most  one 
device  connected  to  it,  Cp  has  the  following  properties; 
(a)  if  |v,,Vy),  (v„Vy)eCp  then  v,  =  v,; 

|b)  if  (v„v,),  (v„v,)eCp  then  v^  =  v,; 

|c)  CrCV.xVo.  ' 

Theortm  d- 1: 

Cp  is  map,  I;l,  onto  from  X  to  Y,  where  XCV| 
and  YCVo- 
Proof  : 

Obvious  by  definition  of  Cp  and  properties  (a), 
(b).  |c). 

D 

Definilion  4-S- 

SfiUm.  Let  K  =  (C)  =  ({C„})  be  a  network, 
with  V|  and  Vg,  and  lot  Cp  be  a  feedback 
correspondence  (CpCVixVg),  then  S  =  (C.Cp)  = 
UCm),  Cp)  is  called  the  system, 

Phpeitoi  impliealiont:  The  Cp  precisely  describes  the 
usage  of  a  network  in  a  system.  If  S(Cp)  =  V|  and 
D(Cp)  =  Vq,  then  the  system  is  folly  retireulaliny.  If 
Cp  ^  O  and  either  S(Cp)  /  V|  or  l)(Cp  )  /  Vg  (or  both), 
then  the  system  is  portiaHy  reeirfolaliny.  If  Cp  =  0 
then  the  system  is  nonrccirculotinp. 

An  example  of  a  system  is  given  in  Fig,  2. 


Fig.  2.  Example  of  a  system. 


Definition  4-4' 

Equality  of  aytteme.  Let  S*'*  =  (C*'*,  C/''*)  and 
Sl^l  =  (C(*l,  C^^l)  be  two  systems.  If  (I) 
V/'l  =  V/*),  V^'l  =  (2)  C^'l  =  C^*l;  and  (3) 

C(i)  =  (hen  S*'!  is  equal  to 
Notation:  Sl'l  = 

Phyairal  implieation:  S*'*  and  S***  are  completely  inter¬ 
changeable. 

Theorem  f  S: 

Sufliciency  condition  for  equality  of  systems.  If  (3) 
holds  in  Def.  4.4  then  (I)  holds. 

Proof: 

(a)  Show:  (3)  -*  V/')  =  V/*>. 

V/')  =  S(C">)  =  S{C<*))  =  V/*). 

(b)  Show:  (3)  -*  Vy>  = 

V^'>  =  D(C('I)  =  D(C<*>)  = 

□ 

The  implication  of  this  theorem  is  that  to  check  two 
systems  for  equality  it  is  only  necessary  to  examine  Cp 
and  C. 

The  definitions  4.6,  4.7,  and  4.8  describe  three 
different  types  of  subsystems:  a,  b,  and  c.  They  are 
presented  in  order  of  increasing  strictness. 

Definition  4-6: 

Subayatem  type  a.  Let  S*'*  =  (C***,  C^'*)  and 
S**)  =  (C***,  be  two  systems. 
lf(l)V/'>  C  vpi,  V^'l  C 

(2)  C/.'*  C  Cf>;  and 

(3)  VCi}l€  C"> 

3  Cf  >  e  C<*l  3  Q  C,'*>  u 

then  S*'*  is  subsystem  type  a  of  S'**.  (“3*  means 
“such  that") 

Notation:  S''>  Ca  S<*>. 

Example  of  subsystem  type  a. 

Let 

S<*l  =  (C'*l,  C/-*’)  be  a  system. 

=  {vo.v,.Vj} 

=  {uo.  u,,  uj( 

=  ((Vj,  Uo).(v,,  U,),(V2,  Uj)) 

C*'  =  (Ci''''.  Cf ,  Cf } 

=  ((v^,  Uo).(Vj,  U,),(V2,  Uj)) 

(•P’  =  ((v,,  U,),(v,,  llj)} 

=  U''l-  u?))- 

l.el 

V|"  =  {Vo  V,) 

Vy>  =  {Uo.  u,} 

Cf"  =  {(Vo,  Uo).(v,,  u,)) 

C">  =  {Cj'’,  C|"| 

Ci'>  =  {(Vo,  Uo),(v,,  u,)) 

C|"  =  {(Vo,  u,)}. 
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Then  (1)  (C*'*,  C^'*)  is  a  system  (denoted  S*'*). 

(2)  (a)  V/'>  C  V/»l,  Vi"  C  Vi*) 

(b)  Ci"Ccj?) 

(c)  ci"  C  c^*)  C  cfi  u  ci*)  . 
cj"  c  C'i*)  c  c]*)  u  ci*) 

-»Sl‘)  CaSW. 

*  Dtfinilion  4-  7; 

Subtytlem  type  b.  Let  S'"  =  (C'",  C'i")  and 
S(*)  =  (C**),  C/?))  be  two  systems. 

lf(l)V/‘)C  vp.vi"c  Vi*)  ; 

(2)  Ci"  C  C^*);  and 

(3)  V  C<J>  €  C")  3  Cl*)  e  d*)  3t  C'l,')  c  c(*) 

then  Sl"  is  subsystem  type  b  of  S**). 

Nolalton:  S")  Cb  Sl*). 

Example  of  subsystem  type  b. 

Let 

sl*)  =  (C**),  C'i*))  be  a  system. 

=  <Vo,  V,.  vj} 

Vi*’  =  {uo,  u,.  u,} 

=  {(''0.  Uo).(Vl.  U|).(V2,  Uj)} 

Cl*)  =  (Ci*),  c}*).  Ci*)} 

C’i*’  =  {Ivo.  Uo).l''o.  U|)>l»2.  uj)} 

L'l*'  =  {{v„  u,),(v„  uj)} 

C.’i*)  =  {(vj,  u,).(v2,  Uj)}. 

l<.t 

V/"  =  {vo.v,} 

Vi"  =  {uo,u,} 

=  {(V  Uo).(V|.  “l)} 

C")  =  (C’i",  C|’)} 

C’i')  =  {(Vo,  Uo)} 

Cl”  =  {(Vo,  Uo),(Vo,  u,)}. 

Then  (1)  (C'l”,  Cf”)  is  a  system  (denoted  S*”). 

(2)  (a)  Vj”  C  Vj*),  Vi"  C  Vi*) 

(b)  C^"  C  Cj.*) 

(c)  Cj”  C  Cj*)  ,  C|”  C  Cj*) 

-*  S”)  CbSl*). 

Dtfintlion  4.8: 

Subiy$tem  type  e.  Let  S’”  =  (C'l",  Cj”)  and 
s’*)  =  (C’l*),  cj*))  be  two  systems. 

If(l)  Vj”  C  Vj*),  Vi"  c  Vi*)  ; 

(2)  cj"  C  Cj*l,  and 

(3)  V  Cj,”  €  d”  3  C'l*)  e  d*)  3  C”)  =  cj*) 
then  S")  is  subsystem  type  c  of  Sl*). 

Notation:  S”)  Qc  Sl*). 

Example  of  subsystem  type  e. 

Let 

Sl*)  =  (d*),  cj*))  be  a  system. 

Vj*’  =  {vo,  v„  V,) 

Vi*)  =  (uo,  u„  Uj) 

Cj*’  =  {(vo.  Uo),(v„  u,),(v,,  Uj)} 


C<*)  =  (Cj*).  C|*),  Cj*)} 

Cj*)  =  {(Vj,  Uo),(Vo.  U,),(V2,  Uj)} 

Cj*)  =  {{v„  u,),(v„  Uj)} 

Cj*’  =  {(v,.  u,),(vj,  uj)}. 

Let 

Vj"  =  {v„  Vj} 

Vi”  =  {u,.  Uj) 

Cj"  =  {(v„  u,)} 
d”  =  (Cj".  Cj”} 

Cj"  =  {(V„  u,),(v„  Uj)} 

C|”  =  {(Vj,  u,),(vj,  Uj)}. 

Then  (I)  (C’",  Cj”)  is  a  system  (denoted  Sl”). 

(2)  (a)  Vj”  C  Vj*),  Vi”  C  Vi*) 

(b)  cj”  C  Cj*) 

(c)  Cj”  =  Cj*) ,  Cj”  =  cj*) 

-*  S”)  Cc  S<*). 

Theorem  4  8: 

Suflicieney  ronditiun  for  subsystem  type  a. 

If  (2)  and  (3)  hold  in  Dcf.  4.6  then  (1)  holds. 

Troof: 

(a)  Show;  (2),  (3)  -w  Vj”  C  Vj*). 

Vj”  =  S(d”)  =  S(U  cj,”) 

c  S(U  (Cj*)  uej*)") 

Since  SM)  CaSl*)-»  VCj,”6d” 

3  Cj*>€d*)  3  C.”)  C  Cj*)  U  Cj*). 

S(U  (Cj*)  u  Cj*)))  C  S(U  (Cj*)  u  cj*»)) 

n»  ft 

3  c,!*)  e  d*). 

s(U(ci*)  u  cj*))) 

=  S(y  cj*»)  u  S(Cj*))  =  vj»). 

Therefore  Vj”  Q  Vj”. 

(b)  Show:  (2),  (3)  -♦  Vi”C  Vi*). 

Similar  to  (a),  with  S{d”)  and  S(d*)) 
replaced  by  D|C'”)  and  D{C'*)),  respectively. 

□ 

Theorem  4- 10: 

SulFicienry  condition  for  subsystem  type  b. 

If  (3)  holds  in  Def.  4.7  then  (I)  holds. 

Hroof: 

Analogous  to  proof  of  Thm.  4.0  (note  that  (2)  is 
not  needed  since  Cf  is  not  part  of  (3)). 

□ 

Theorem  4.tt: 

Sulliciency  condition  fur  subsystem  type  c. 

If  (3)  holds  in  Def.  4  8  then  (1)  holds. 

FrooJ: 

Analogous  to  proof  of  Thm.  4.10. 

□ 
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Theorem  4.  It: 

Let  S<')  =  (C«>,  C^'l)  and  S<*'  =  (C'*',  be 

two  eyeteme. 

(1)  If  Sf"  =  S'*l  then  S("  CeS<*l. 

(2)  IfS"ICcS<*'thenS<"CbS'*>. 

(3)  If  S<'>  gb  S<*>  then  C*  S^. 

Froo/: 

Obvious,  follows  from  definitions  of  subsystems. 

Q 

Theorem  4-  It: 

Let  S<'>  =  (C<'l.  Cj.'')  and  S**'  =  (C<*>,  Cf*')  be  two 
systems. 

If  (1)  S<'>  Cc  S<*»  and  (2)  S(*>  Ce  S<'>,  then 
SCI  =  SCI. 

Proof: 

Show;  (1)  V/'l  =  V/»),  Vi'l  =  Vi»l;  (2)  C/.'*  =  C/.*); 
and  (3)  d"  =  d*'. 

Remork:  Prom  Thm.  4.11  it  is  known  that  (3) 

(1),  so  only  (2)  and  (3)  have  to  be  shown. 

(a)  Show:  cj.'l  = 

SlO  Ce  SCI  C^'l  C  C^*>. 

SO  Qc  SCI  -•  C^*l  C  cj.".  -w  C/.'l  =  C4*l. 

(b)  Show;  d'l  =  d*>. 

VC  Cl  €  d'l 

3  unique  CCI  g  d*l  3t  CCI  =  CC). 

Similarly  VCCl  e  CCI 
3  unique  CCI  e  d'l  3  CCI  =  C*''. 

-♦  d'l  =  cci. 

□ 

V.  Qvnnlmorphlsm 

In  this  seetion  the  main  results  are  presented.  A 
new  similarity  measure  between  systems  is  defined  that 
allows  a  comparison  between  arbitrary  (regular  and 
irregular)  systems.  This  measure  is  called  quasimor* 
*  phism  and  is  completely  specified  by  two  mappings 
called  ^  and  ^o-  'Tl**  quasimorphism  will  facilitate  the 
analysis  of  following  problems  in  parallel  processing; 

(a)  system  A  emulating  system  B  (three  different 
degrees  of  strictness  of  emulation  are  discussed); 

(b)  fault  tolerance/ reliability  (achieved  by  multiple 
mapping  of  same  problem  into  a  system); 

(c)  partitioning  of  a  system. 

The  concept  of  quasimorphism  provides  an  analytical 
method  to  study  network  properties  that  are  implemen¬ 
tation  independent,  such  as  emulation  and  partitioning. 
Defmilion  5.1: 

QvssimorpAism  tyye  (a,h,t):  where  (a,b,c)  means 
one  of  a  or  b  or  c. 


Let  S"!  =  (d",  C^'>)  =  ({CC'),  C/.") 

=  ({{(v,.Vh)}),  {(v„Vy)));  and 

SO  =  (CO.  cP)  =  ((CO),  CP) 

=  ({{(w..Wb))|,  {(w„w^)))  be  two  systems. 

If  3  V  =  (^|.  ^0) 

(1)  0|;  V/C  -►  V/*l  is  a  map 

(2)  ^0:  Vii'l  —  is  a  map 

(3)  (S)  — »  (S)  is  a  map  such  that: 

V^SO)  =  V<{{(v„Vb))),  {(V.,v,))) 

^({{(^,(vj,0o(vb)}}.  {Wv.).«oM}) 

C  (a,b,c)  sO, 

then  V  =  (^i.  ^0)  gvaeimorphiem  type  (o.b.e) 
from  SC)  to  sO. 

Phyeieal  imph'talione  of  quaeimorphiem:  Given  two  sys¬ 
tems  with  arbitrary  vertex  descriptions,  if  there  exist 
type  (a,b,c),  that  is,  a  and  with  the  proper  con¬ 
straints  from  SC)  to  SCI,  then  SO  and  SC)  are  similar  in 
a  topological  sense.  The  loosest  similarity  is  4)  type  a. 
The  strictest  similarity  is  4>  type  c.  The  4>  =  (^,  ^q) 
precisely  describes  how  to  handle  the  following  prob¬ 
lems;  (I)  emulation  of  systems;  (2)  identifying 
equivalent  systems;  and  (3)  partitioning  of  a  network. 

Additional  auxiliary  maps  based  upon  ^  and 
are  defined  to  facilitate  later  analyses. 

Definilion  5.2: 

^1,0  map. 

Let  SC)  =  (CC>,C^'I);  and  SO  =  (CO,CP)  be  two 
systems. 

Let  4>\:  V/'l  -  V/*'  be  a  map;  -  Vi»> 

be  a  map. 

Define:  ^,,0:  V/'»  x  ^  V/»»  x  V^*) 

^1,0  ((v..''b))  -  ^ol^b))- 

Definition  5..?; 
map. 

Let  S")  =  (C">,  C^'));  and  S(*'  =  (d*>,  C^')  be 
two  systems. 

^iiO-  W*  W**  ^ 

Define;  /i:  P(Vf'')  x  V^”)  -  P(V/*)  x  V^*)) 

/i(((%.Vb)))  ^  {^.o((v..Vb))}. 

Lemma  5-4: 

If  d|.  ^te  1:1  maps  then  ^|,o  >>  LI  map. 

Proof: 

Follows  from  definition. 

□ 

Lemma  5.5: 

If  ^bO  ”  LI  map  then  /i  is  1:1  map. 

Proof: 

Follows  from  definition. 

□ 
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De/mtiion  5.6; 

Alltmalt  nolaUoH  /or  guatimorphitm. 

Let  Sfl  =  (C<'>.  C^'l)  =  HUv^v,))},  t(v,.v,)))  be 
s  syttein. 

Then:  ^S‘»)  =  VHU{(v..Vfc)»,  {(v..v.)})) 

=  ({>«l{(»,.»b)})}.  /'({(»•.*•)})) 

=  ({{*Uo((V..Vb)))).  {^Lio((v..V.)))). 

Lemma  5. 7: 

Let  S*'>  =  ({{(v„Vk)}},  {(v,.v,)})  be  a  system. 

Let  p:  P(V/'>  X  Vi'>)  -  P(V/*>  x  Vi*>). 

Let  i>:  {S}  -•  {S}  be  a  quasimorphism. 

VKS"'|  =  ({<4({(v..Vb)))}.  <i(((v..v.)})l. 

It  p  is  1:1  map  then  V*  is  1:1  quasimorphism. 

PraoJ: 

Straightforward,  but  tedious. 

Q 

Theorem  5.8: 

Let  ^  ^o)  *■  quasimorphism. 

If  ^  and  0Q  are  1:1  maps  then  0  is  1:1  quasimor¬ 
phism. 

Proof: 

il)  ♦!.  #0  ♦uo ‘-MLemma  S.4). 

(‘f)  ^lO  LI  p  1:1  (Lemma  5.5). 

(3)  p  1:1  -w  V>  LI  (Lemma  5.7). 

□ 

Pkyeieal  impUealione:  Suppose  there  is  a  such  that 
=  sl*l  and  t()l'l(S*)  =  S**!  and  is  1:1.  First, 
this  means  that  S*''  =  S**'  since  V*'*'  is  LI.  Second,  and 
more  important  from  an  engineering  point  of  view,  the 
11  guarantees  an  eHicient  emulation  of  S^')  by 
That  is,  if  all  V|  were  connected  to  processors  and  Vg 
to  memories,  the  emulation  would  be  such  that  the  pro- 
‘cessing  work  of  one  processor  in  S*'*  would  be  exactly 
equal  to  the  processing  work  of  one  processor  in  the 
image  of  S^'*  in  S*^^.  Also,  the  amount  of  data  stored  in 
a  single  memory  unit  in  S*'*  would  be  exactly  equal  to 
the  amount  of  data  stored  in  memory  unit  in  the  image 
of  S^'l  in  S(^).  In  other  words,  the  mapping  is  regular  in 
some  sense.  Analogously,  the  load  balancing  and  utili¬ 
zation  in  the  image  of  in  will  be  identical  to 
that  in  sC). 

Definition  5.9: 

Let  (S)  be  a  set  of  systems. 

Define  the  relation  R  of  type  t^a,!:!)  on  (S) 
denoted  by  K-tt)(a,l:l)  as  follows: 

(S*'*,SW)  e  R-t(<(a,l:l)  iff  9  a  quasimorphism 
♦  =  (^.♦o)  t,  1:1  from  to 
Theorem  5. 10: 

Let  R-tiHa,l.t)  be  as  in  Def.  5.9  then: 

(1)  K-i'fa.M)  is  reflexive, 

(2)  R-tH  a,  1:1)  is  not  symmetric, 

(3)  K-tKniL'l)  is  transitive. 


Proof: 

For  (1):  To  show  reflexivity,  need  to  show 
(S“),S">)6R-V^a,I:l)  VS'*'€{S).  Let 

V*  =  (^i.^o)  identity  maps. 

The  rest  is  straightforward. 

For  (2).  show  R-^a,l:l)  is  not  symmetric. 

Must  show  (Sf),  S^^l)  €  R-tK*iLl)  does  not  imply 
(Sl«,  sO)  e  R-Vfa,!:!). 

Outline:  Constructing  an  example  of  and 
such  that  (S"I,S(»)  €  R-^a,l:l)  and 
(Sl^l,  Sl'>)  ^  R-t(>(a,l:l).  Although  an  example 
where  would  suffice  a  more 

interesting  example  is  given. 

Let 

SC)  =  (C<'),C/.'>) 

V/'»  =  {v„v,l,Vi>'>  =  {«„«,> 
c^'l  =  ((V»,u,),  (V,„U4)) 

C(‘»  =  (Ci*',  C}") 

=  {(Vk.u„)}. 

Cj"  =  t(v.,u.),(v..u,)). 

Let 

sP-i  =  (d*),  cp>) 

V/»>  =  (W..WI,},  Vi*»  =  {x.,x,} 

cP  = 

C(»)  = 

CP  =  {(Wk,x.)), 

CP  =  {(w„x,),(w„x,,)}. 

Then  i>  =  (♦i.^g)  with  ^ifv,)  s  w„  ^ifv^)  =  w^, 
^o(o«)  =  x„  ^©(“d)  -  *d  “  quaaimorphism  type 
a,l:l  from  S*')  to  but  there  does  not  exist 
quaaimorphism  type  a,l:l  from  to 
For  (3):  show  R'^a,l:l)  is  transitive. 

Must  show.  (S^'>,S<*>)  G  R-V>(x,Ll), 

(S'*I,SW)  6  R-^n.Ll)  -•  (S<‘>,SW)  €  B-V<».L1). 
Outline;  Transitivity  will  be  shown  by  exhibiting 
t&(a,l:l)  from  8*'^  into  S***. 

(1) :  Let  =  (d‘),Ci.‘>);  SW  =  (d»>,Ci.»));  nod 

S<*>  =  (C*®I,C^*I)  be  three  systems. 

(2) :  (S»'),SW)  G  R-l(<s,l:l) 

-♦  3  :  V/'»  -  V/*» ,  1:1 

and  3  4*’  :  V^')  -  ,  1:1. 

(3) :  (2)  and  Lemma  5.4 

-»  3  W.'i  V/')  X  Vi')  V/»l  X  Vi*) ,  1:1. 

(2)  and  Lemma  5.5 

-»  3  m">  ;  P(V/'>  X  Vi'))  -  P(V/»)  X  Vi»l) , 
LL 

(4) :  (S'*), 8'*))  6  R-V(s,l:l) 

-*  3  :  Vp  Vp  ,  1:1 

and  3  4*) :  Vp  ^  Vi») ,  1:1. 

(5) :  (4)  and  Lemma  5.4 

-*  3  W/i  :  V/»)  X  Vi»)  V/»)  X  Vi») ,  1:1. 
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(4)  and  Lemma  5.5 

-»  3  /|I21  :  P(V/*)  X  V^*')  -  P(V/*)  X  V^*')  , 

l;l. 

(8):  Define;  =  *pi  o  ;  V/'>  -  V/’>.  (“o"  la 
tomponUion  of  map«) 

Clearly:  0|  is  map,  1:1. 

(7) :  Define:  4>o  -  4)^^  o  :  V^'l  -»  V^*>. 

Clearly:  is  map,  l;l. 

(8) ;  Define;  ^,o  =  o  4>i'})  ■ 

Ko  ■■  V/"  X  vyi  -  V/»  X  V/j*). 

Clearly:  0|,o  is  map,  1:1. 

(9) ;  Define:  p  =  p***  o  p*'*  : 

p  :  P(vi'l  X  Vyi)  P(V/»  X  Vi’l). 

Clearly;  p  is  map,  1:1. 

(10) ;  Claim;  0  =  (0|,0o)  **  quasimorphism  from 

to  type  a,  1:1. 

(11) :  Show:  p(C^‘l)  C  C;?t. 

(S">.S<*')  G  R-0(a,l:l)  -*  p''>(C^")  C  C^*>. 
(S<*>,S<*>)  G  R-VHo.l:!)  -*  p'*'(C/-*')  C  C^*». 

p<*l(p('l(C^')))  c  C^.*! 

-w  (pW  o  p<'>)(C^'>)  =  p(C/.'>)  C  C^l 

( 12) ;  Show;  V  ClJ'  G  C<'>  3  C*®'  G  C<»' 

3:p(Cl,'l)CC^>)UC;.='l. 

(a) :  (S"t,sW)  G  R-0(a.l:l) 

VCfJtGCO  3  C(*l  GC(*> 
9:p('t(C(,'t)  CC^UC^*'. 

(b) :  (S<*I,S<*))GR-V^a,l:l) 

-*  V  C<*>  G  C(**  3  C<*>  G  c<*> 
3:i*'*’(Ci»>)CC^’)UC^’>. 

(c) :  (a).(b)-*  VC<,'t6C('l 

3Cl*lGC(*l9:p<V"(C<,")) 

C  p‘*>(Ci’»  U  Cf»>) 

=  p(«|C  W)  U  p'*'(CP) 

g  (c^st  u  c^’')  u  p<*'  (c^.*>). 

(d) :  (c)  and  C^*'  D  p<*'(Cf*>) 

-*  pi  V>(Ci"))  =  (pl«opl")(C<,") 

=  p(Cl,‘))  C  C w  u  ci.*'. 

(13) :(11)  and  (12)  0  =  (0|,0o)  is  quasimorphism 

type  a. 

(14) :(0)  and  Lemma  5.7  "W  0  U  1:1. 

(15) :(13)  and  (14)  -•  (S<'),S(*))  G  R-0(a,l:l). 

(1ft): Conclusion;  (15)  “♦  R-0(a,l:l)  has  the  pro¬ 
perty  of  transitivity. 

Q 

Phfiital  impficafiona;  If  VKS***)  C  a  Sl**  then  the  system 
Sl*l  ean  emulate  system  S*''.  The  movement  of  the 
data  is  accomplished  (a)  by  using  the  network  (Cjl^l) 
correspondences,  and  (b)  by  using  the  feedback  or  inter¬ 
nal  connection  of  the  device  connected  to  both  input 
and  output  of  the  network. 


This  type  of  emulation  always  exists  if  the  S^^l  sys¬ 
tem  is  partially  or  fully  recirculating.  If  the  network  in 
is  partially  or  fully  recirculating  then 
3  (''.Ay)  G  C^-*'.  Then  using  maps  0|(Vj)  =  v, 

VViG  V^'*,  0o(''j)=''y  VVjGV^'l  will  satisfy  the 
neces.sary  conditions  for  quaaimorphism  type  a.  This 
however  will  result  in  very  poor  load  balancing.  Great 
improvement  in  load  balancing  optimality  will  result  if 
the  quasimorphism  is  1:1.  Then  each  device  in  0(S*’*) 
(the  image  of  under  0)  will  have  same  amount  of 
computation  (data)  as  the  corresponding  device  in  Sd. 
De/ma/ion  5.11: 

Let  (S)  be  a  set  of  systems. 

Define  the  rtlalion  R  of  type  0(5,1;1)  on  (S) 
denoted  R-0(b,l:l)  as  follows: 

(S''*,S'’')  G  R-0(b,l:l)  iff  3  a  quaaimorphism 
0  =  (0i.0o)  lyp*  !>•  ff®"*  si‘)  to  si*i. 

Theorem  5. IS  : 

Let  R-0(b,l;l)  be  as  in  Def.  5.11  then: 

(1)  R-0(b,l:l)  is  reflexive, 

(2)  R-0(b, 1:1)  is  not  symmetric, 

(3)  R-'0(b,l:l)  is  transitive. 

Proof. 

For  (1):  Reflexivity:  similar  to  proof  of  Thm.  5.10. 
For  (2):  Show:  R-0(b,l;l)  is  not  symmetric. 

Must  show  (S''*,S**')  G  R~0(b,l:l)  does  not  imply 
(S'*I.S<'>)G  R-0(b.l:l). 

Outline:  Constructing  an  example  of  S*'l  and  S**( 
such  that  (S">.S'*')G  R-0<b,l;l)  and 
(S<*l,Sf'>)<t  R-0(b,l:l). 

'  l..et 

St')=(c('»,c/.'>) 

V/*>  =  (v,.Vb).Vy>  =  (u.,Uj} 

Cf"  =  {(v..u,)) 

=  {ci'».c|'>,ci") 

Ci'»  =  ((V..U,)) 

C}'>  =  {(Vb,u,)) 

Ci‘’  =  ((v.,u,)). 

Let 

S'*>  =  (C(*>,C^*>) 

VP  =  {'v„'Vb),  Vi*>  =  (x^Xj) 

CP  =  {(w.,x.)} 

c««  =  {cf,cp)) 

Ci*’  =  {(w„x,,),(w„x,)) 

Cj*>  =  {('VbAd)}- 

Then  0  =  (0,,0o)  =  n- 

and  0o(**.)  =  ^o(“d)  =  *d  “  quaaimorphism 

type  b,  1:1  from  S*’l  to  S^*!,  but  there  does  not 
exist  quaaimorphism  type  b,  1:1  from  to  S*'*. 
For  (3):  Show:  R-0(b,|:l)  is  transitive. 

Must  show:  (S**I,S**')  G  R~0(b,l:l), 


-•  |S<‘>.S‘*>)  e  R-VHb,l;I) 

Outline:  Transitivity  will  be  shown  by  exhibiting 

^b,l;l)  from  S^**  to  S***. 

OV-  tel  S“>  =  (CC'.C^");  Sl*>  =  (C^.Ci?*);  and 
be  three  systems. 

The  proof  is  similar  to  the  one  of  Thm.  5.10  part 

3,  steps  2  through  11  except  replace  t^a,l:l)  by 

V-fb,l:l). 

(12) :  Show:  €  CO  3 

3=  P(Ci'>)  C  cw. 

(a) :  {S"',Sl*>)eR-V(b.l.U 

-*  vci'f  e  cf't  3  c<*>  e  c**) 
s  p<"(C<,'>)  c  C«). 

(b) :  (S<*'.SW)€R-V'(b.l:l) 

-»  VCl*>  €  3  C<®>  e  Cl*> 

3  P>*>{Cl*>)  C  Ct»). 

(c) :  (a).(b) 

^  ^c<^)  g  d')  3  e  c(*i 

3  C  ))  c  c^>i. 

(13) :(11!  and  (12)  ^  “  quasiinor- 

plium  type  b. 

(14) : (0)  and  Lemma  5.7  '*w  ^  in  tJ. 

(15) .(I3)  and  (U)  -*  (S<'<.Sl^>)  €  R-V(b,l:I). 

(Id).  Conc/vston:  (15)  R-V'|b,l;t)  has  the  pro¬ 

perty  of  transitivity. 

ID 

Phytical  implication:  If  V’IS*’*)  C  b  S***  then  the  system 

can  emulate  system  The  movement  of  the 
data  IS  accomplished  by  using  the  network  corrcspon* 
dcoces  This  type  of  einulation  ia  harder  to 

achieve  than  the  type  a  since  the  contribution  can¬ 
not  be  used  to  move  the  data.  Again,  as  in  type  a,  the 
load  balancing  optimality  will  greatly  increase  if  the 
quasimorphism  is  1:1.  If  the  qiiasimurphism  is  1:1  then 
the  load  balancing  as  well  as  utiliiation  in  the  image  of 
S*'l  in  will  be  identical  to  that  in 


The  quasimorphism  can  be  used  to  map  multiple 
copies  of  system  S***  into  where 

^•'*(b''*)  n  V'***(b*'')  =  0  IS  necessary  additional  con¬ 
straint  This  Will  allow  tandem  crosschecking  of  partial 
results  of  a  computation  and  therefore  can  be  used  as 
error  detection  for  fault  tolerance. 


Otjmilion  5.  IS: 

Let  (S)  be  a  set  of  systems. 

Define  the  relation  R  of  type  V^c,l;l)  on  (S) 
denoted  R-Vi(c,l:l)  as  follows: 

(Sl'*,S**')  6  R-V^c,l;I)  iff  3  a  quasimorphism 
i>  -  (♦p^o)  ^yp*  ^  I  i-i  s*'*  to  s'"'*!. 

Theorem  5.14:  Let  R-^^c,!.!)  be  as  in  Def.  5.13  then; 
(1)  R'V'fc.l.'l)  is  reflexive, 


(2)  K-^c,l:l)  is  not  symmetric, 

(3)  R-V'(c,l:l)  is  transitive. 

Proof: 

For  (I):  Rcficxivity  similar  to  proof  of  Thm.  5.10. 
For  (2):  Show  R-VHc,l:l)  is  not  symmetric. 

Must  show,  (S<'*,S***)  6  R-Vitc,!:!)  does  not  imply 
(SU1,S<‘>)  e  R-V-lc,!:!). 

Outline:  Constructing  an  example  of  and 
such  that  (S<'l,Sl*))  €  R-V<c,l;l)  and 
(SWS<'>)  ^  R-V-(c,l:l). 

Let 

S<'»=(C<'>.C^'>) 

V/"  =  (v.,v,}.Vi')  =  {u..u,} 

4'>={(v„u,)} 

d'l  =  (ci'i,c|'>) 

Ci'>  =  {(v,.Ud),(vt,u.)} 

C|'>  =  {(v.,u.)}. 

Let 

SW  =  (C(*).CP) 

V?»'  =  {w..w,}.Vi*>  =  {x..x„) 

=  {(Wb,xa)} 

C<*'  =  {Cfl.Cj»>,Cl»)) 

=  {(w»,x,)} 

CP  =  {(w„Xj),(w,„x,)} 

Ci*'  =  {(Wb,x,,)}. 

Then  ^  =  («|,do)  '*'Rb  ^,(v,)  =  w„  d|(*b)  =  n. 
and  i>o(Ut)  =  Xc>  dolt*!))  =  >4  »  quasimorphism 
type  c,  1:1  from  to  but  there  does  not 
exist  quasimorphism  type  c,  1:1  from  to 
Fur  (3):  Show  K-^c,l;l)  is  transitive. 

Must  show:  (Sl'*,Sl*l)  £  R-V>(c,l:l), 

(St*>.S<*')  £  R-^c,l:l)  (S('>,SW)  £  R-yi(c,l:l). 

Outline:  Transitivity  will  be  shown  by  exhibiting 
tfi(c,l:l)  from  S*'*  into  S***. 

(1):  Let  S">  =  (C('»,C/.'))i  S<*l  =  (Cl*>,C/.*l);  and 
=  (C**l,Cf*')  be  three  systems. 

'I'he  proof  is  similar  to  the  one  of  Thm.  5.10  part 
3,  steps  2  through  11  except  replace  tf'(a,l:l)  by 
V(c.l;!). 

(12): Show  VC,!,')  £  Cl')  3  C<*>  £  €•*> 

3:  p(Ci'))  =  C<*>. 

(a) :  (S('>,s(»t)  £  R-^c,l:l) 

-»  VC<;l£C<"  3CW£C<*) 

3  #."»(C','))  =  C(»). 

(b) .  (S(*'.Sl>))  £  R-VHc,l:l) 

-•  VC,1*>  £  Cf*>  3  €  C<*» 

3  p<»>(Ci»>)  =  C<»). 

(c) ;  la),(b) 

-*  VCi'>  £  CC)  3  CW  £  C(»> 

3  li'»'(p"'(Ci,')))  =  pW(cW)  =  c<*). 
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(13) :(I1)  and  (12)  0  =  (d|,do)  “  qumsimor- 

phism  type  c. 

(14) ;  (9)  and  Lemma  i.7  i>  i»  1:1. 

(15) :(13)  and  (14)  -*  (S<'>,S(»I)  e  R-VHc,l:l). 

(1ft);  CenWeaton.*  (15)  R-i>(c,l;l)  has  the  pro¬ 

perty  of  transitivity. 

Q 

/’Apsiea/  imp/ieelions:  of  a  quasimorphism  type  c.  Since 
it  is  required  in  type  b  that 

VCiJ>  €  c<'’  3  C<*>  e  C<*>  31  C<J>  c  c<*> 

there  may  be  some  side  effects  caused  by  emulating 
the  correspondence  CjiJK  Moreover,  these  uncontrolled 
side  effects  will  not  allow  partitions  to  operate  indepen¬ 
dently.  That  is,  connections  that  are  part  of  but 
not  part  of  may  be  established  when  is  used 
to  emulate  C,*,'*.  This  may  or  may  not  be  a  problem. 
To  analyse  this  potential  problem,  the  type  c  was 
defined.  With  a  type  c  quasimorphism,  when  the  sys¬ 
tem  emulates  system  SO,  the  movement  of  the  data 
is  accomplished  by  a  subset  of  The  difference 

between  type  b  and  type  c  is  that  in  type  c, 

VC<,'>  €  3  CW  €  C<*>  3:  Cj^f  =  Ci*>. 

This  requirement  will  eliminate  the  side  effects  that 
type  b  has.  More  importantly  it  means  that  0(S('))  is 
actually  an  autonomous  subsystem  of  The  auto¬ 
nomous  property  will  be  explored  further  in  a  later 
paper  studying  partitionability. 

VI.  Coaeluslona 

In  this  paper  a  theoretical  basis  for  analyzing  both 
topologically  regular  and  irregular  interconnection  netr 
works  was  developed.  A  rigid  graph/algebraic  model 
that  can  be  applied  to  both  regular  and  irregular  inter¬ 
connection  networks  was  defined  and  its  usefulness  and 
flexibility  was  demonstrated  in  subsequent  analyses. 
An  important  and  very  useful  measure  of  similarity  of 
networks  called  quasimorphism  was  introduced.  Three 
types  of  quasimorphism  were  defined  between  two  sys¬ 
tems  and  where  type  a  is  the  least  strict  and 
type  c  the  most  strict.  Necessary  conditions  for  each 
type  of  quasimorphism  are  given  and  their  properties 
analyzed.  The  model  and  the  quasimorphism  relation 
provide  the  necessary  theoretical  background  for  study¬ 
ing  the  following  problems  of  parallel  processing. 

(a)  Emulation  of  system  by  system 

(b)  Fault  tolerance  method  achieved  by  con¬ 
current  execution  of  multiple  copies  of  the 
same  problem. 

(c)  Partitioning  of  a  system. 

Future  work  includes  characterizing  the  necessary  con¬ 


ditions  for  partitioning  of  a  system  and  studying  multi¬ 
level  quasimorpbisms  for  analyzing  systems  involving 

multiple  networks. 
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Abstract 

As  hardware  becomes  less  expensive,  more  and  more 
distributed  algorithms  will  be  implemented  by  special 
purpose  multiprocessor  systems.  An  important  com¬ 
ponent  of  such  a  system  is  the  processor  interconnection 
network.  A  general  model  of  interconnections  is  used  to 
formally  study  composition,  decomposition,  and  partitio- 
nability  properties  of  networks.  For  the  reasons  of  imple¬ 
mentation  efficiency  and  reliability,  these  properties  of 
networks  are  salient  factors.  Three  different  types  of  par- 
titionability  are  distinguished  and  described  and  their 
properties  shown.  An  algorithm  is  presented  and  proven 
correct  that  will  accept  as  its  input  an  arbitrary  intercon¬ 
nection  network  and  will  produce  one  of  four  passible 
outputs:  (H  the  network  b  not  partitionable;  (2),  (3), 
and  (4)  the  network  is  partitionable  in  one  of  the  three 
types  of  partitionability  described. 

I.  Introduction 

As  hardware  becomes  less  expensive,  more  and  more 
distributed  algorithms  will  be  embedded  into  special  pur¬ 
pose  multiprocessor  systems  (e.g.,  9,  10,  12|.  Most 
current  research  on  interconnection  networks  is  specific  to 
a  single  network  or  a  single  class  of  networks;  it  consists 
of  defining  a  model  for  the  network  or  class  to  be 
analyzed  and  using  it  for  the  analysis  [16].  This  method 
suffers  from  the  following  drawback:  the  model  usually 
holds  for  only  the  network  or  class  in  question  and  there¬ 
fore  the  analytical  results  are  useful  only  for  that  network 
or  class.  A  solution  to  this  problem  is  to  define  a  com¬ 
pletely  general  model  as  was  done  [11|.  By  using  thb 
model,  analytical  results  are  applicable  to  most  classes  of 
networks.  The  model  was  defined  and  used  in  (11|  to 
analyze  the  emulation  properties  of  networks. 
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In  this  paper  the  properties  of  network  composition, 
decomposition,  and  partitionability  are  analyzed.  An 
algorithm  is  developed  which  will  output  one  of  the  fol¬ 
lowing; 

(1)  The  network  is  not  partitionable. 

(2)  The  network  b  partitionable  into  subnetworks  with 
common  control  signaU  and  the  combination  of  the 
of  the  subnetworks  will  exactly  generate  all  intercon¬ 
nection  patterns  of  the  original  network. 

(3)  The  network  b  partitionable  into  subnetworks  with 
separate  control  signab  and  the  combination  of  the 
subnetworks  will  exactly  generate  all  interconnection 
patterns  of  the  original  network. 

(4)  The  network  b  partitionable  into  subnetworks  with 
separate  control  signab  and  the  combination  of  the 
subnetworks  will  generate  a  superset  of  interconnec¬ 
tion  patterns  of  the  original  network. 

The  partitionability  property  of  interconnection  net¬ 
works  for  parallel  computer  systems  b  important  for  the 
following  reasons. 

(1)  If  the  network  b  partitionable  then  the  resource 
allocation  of  only  a  subset  of  the  total  resources  b 
possible.  Thb  can  be  used  as  follows. 

(a|  The  allocation  of  only  a  subset  of  the  total 
resources  b  possible  so  that  a  user  can  use 
only  a  small  part  of  the  machine  for  program 
development  and  to  use  the  whole  machine 
when  the  program  b  developed. 

(b)  In  a  multiple  user  environment  the  partition¬ 
ing  provides  a  natural  protection  among  users. 

(c)  In  a  multitasking  environment  the  partition¬ 
ing  provides  a  protection  among  independent 
tasks. 

(3)  If  the  network  b  partitionable  the  fault  tolerance  of 
the  system  increases  as  follows. 

(a)  A  method  of  graceful  degradation  b  possible  by 
separating  the  faulty  section  from  the  correctly 
operating  ones. 
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(b)  If  Id  addition  to  being  a  partitionable  network, 
the  sections  are  isomorphic,  then  an  increase  of 
reliability  may  be  realized  by  multiple  map¬ 
pings  of  the  same  task  onto  the  multiple  sec¬ 
tions  and  tandem  cross  checking  of  partial 
results. 

(c)  It  is  possible  to  construct  a  fault  tolerant  net¬ 
work  using  a  partitionable  network  as  a  core  as 
will  be  shown  in  future  work. 

I3|  If  the  network  is  partitionable,  then  there  is  an 
etbcieni  implementation  in  terms  of  hardware  and 
control.  The  network  can  be  implemented  as  a  set 
of  network  components  each  with  its  own  set  of 
inputs  and  outputs. 

(a)  If  an  input/output  belongs  to  the  input/output 
set  of  a  component  network  then  it  does  not 
belong  to  a  different  component,  and  the  rout¬ 
ing  of  the  data  paths  on  a  VLSI  chip  or  on  a 
printed  circuit  board  will  be  simplified. 

(b)  In  addition,  only  the  subset  of  controls  that 
affect  a  particular  partition  will  be  connected 
to  it,  therefore  the  control  lines  routing  may  be 
simplified. 

The  results  presented  here  are  applicable  to  all  network 
topologies.  It  is  assumed  here  that  the  reader  is  familiar 
with  basic  graph  theory  [1,  6|  and  basic  abstract  algebra 

is.  'i 

The  paper  is  organized  as  follows.  In  section  0  the 
basic  concepts  are  defined.  The  definition  of  an  intercon¬ 
nection  network  with  an  arbitrary  topology  is  given  in 
section  HI.  In  section  IV  three  different  types  of  partitio- 
nabii’ty  of  interconnection  networks  are  described.  In 
section  V'  an  algorithm  is  presented  which  determines  if  a 
network  is  partitionable,  and  if  it  is,  differentiates  among 
three  types  of  partitionability. 

n.  Bnak  Deflnitioiu 

In  this  section,  basic  definitions  needed  as  back¬ 
ground  for  the  rest  of  the  paper  are  given. 

Let  the  set  of  input  labels  of  a  graph/algebraic 
structure  be  denoted  by  Vj  and  the  set  of  output  labels 
of  the  structure  be  denoted  by  Vq  .  All  graph/algebraic 
structures  defined  in  this  paper  over  V'|  x  Vg  will  assume 
that  V|  n  V'o  =  0.  V|  0,  Vg  ^  0,  where  0  is  the 
empty  set  and  V,  x  Vq  =  { < v..Vb>  1  €  V,,  v^  €  Vg}. 

The  following  notation  will  be  used  throughout  this 
paper.  The  symbols  are  enclosed  in  a  pair  of  double  quo 
tation  marks. 

-  delimiters  for  set. 

"I*. -  function  application  and  grouping  of  operations. 


-  delimiters  for  n-tuple. 
defined  in  context. 

Definition  2.1:  Let  C„  C  (V|  x  Vg],  then  C„  is  an  [fO 
eorrespondenee  over  Vj  %  Vg  . 

Definition  2.2:  Let  C  [V|  x  Vgj  such  that 

<v»,v,,>,  <Vj,vj>  e  C„  v^  jc  vj,  then  the 
is  a  nondeatruelive  eorrespondenee.  (Physi¬ 
cally,  represents  one  state  of  a  reconfigurable 
network). 

Definition  2.S.  Let  C[V,  x  Vgj  A  {c„  C  [V,  x  Vgj  ]  C. 
is  nondestructive}.  Then  C[V,  x  Vgj  is  called  the 
C-sef  over  V/  x  Vg  . 

Definition  2.4'  Let  £  C[V|  x  Vgj,  then 

s(C„)  i  {v,|  <v„v^>  €  C„}  is  the  source  eel  of 

C«- 

Definition  2.S:  Let  £  C[V|  x  Vgj,  then 

d(C„)  ^  (v^j  <v„v^>  £  C,}  is  the  destination 
set  of  <7„. 

Definition  2.6:  Let 

C  =  {C„  I  m  =  l,2 . n}  C  C|V,  x  Vgj.  then 

s(C)  4  U  a(C„)  is  the  source  set  of  C. 

HI 

Definition  2. 7:  Let 

C  =  {C„  I  m=l,2 . n)  C  CfV,  x  Vg),  then 

d(C)  ^  U  <I(C„)  is  the  destination  set  of  C. 

m 

m.  Interconnection  Network  Model 
In  this  section,  a  formal  graph/algebraic  model  of  an 
interconnection  network  is  presented.  Graph  modeb  for 
analyzing  networks  have  been  used  by  other  researchers. 
For  example,  in  [3,  4,  8.  15j  they  are  used  to  analyze  reg- 
ula'  SW-banyan  networks,  and  in  [2|  they  are  used  to 
study  the  partitioning  of  regular  networks.  The  model 
presented  here  differs  from  |3,  4.  8,  ISj  and  |2|  by  being 
completely  general  so  that  it  can  be  used  to  describe  an 
arbitrary  (including  topologically  irregular)  interconnec¬ 
tion  network. 

Definition  3./;  Let  K  =  <C>  be  such  that: 

(1)  C  C  C[V,xVgj. 

(2)  V,  =  ,(C). 

(3)  Vg  =  d(C). 

(4)  jCj  >2. 

Then,K  =  <C>  is  an  f/O  representation  of  a 
reeonfifurobte  network  over  V,  x  Vg  . 

.An  example  of  an  arbitrary  interconnection  network  and 
description  of  it  using  this  notation  b  shown  in  Fig.  1. 
Physieai  imptieations:  <v„v^>  £  C„,  C.  £  C  represents 
the  network  moving  data  from  input  v,  to  output  v^ 
when  the  state  of  the  network  b  C„.  C  represents  the 
set  of  all  possible  states  of  the  reconfigurable  network. 
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^C  =  {Co.C,} 

Co  =  {<A,  D>,  <B.  E>} 

C,  ={<A,  E>,  <C,  D>} 

V,  =  {A.  B.  C}.  Vo  =  {D.  E} 

Fig.  I.  Example  of  an  arbitrary  intercoDneetioo  act* 
work. 

Dtfinition  S.S:  Let  K[V,xVo|  ^  (K  |  K  =  <C>  is  a 
network  over  V|  x  Vq}.  Then  KfVjxVoI  is  called 
the  K'ttt  over  V,  x  . 

Definition  S.S;  Let  K'  €  K[V,'xV^l.  K'  =  <C'>,  and 
K*  €  K(Vf  xVg],  K*  =  <C*>,  be  two  networks 
such  that: 

(1)  V,'  C  V,*  Vi  c  Vi. 

(2)  vci  e  C  ac*  e  c*  a-,  ci  s  c*. 

Then  /C*  is  enbnetwork  of  Ifpe  h  of  K* .  Nototion; 
K'  Cb  K\ 

Definition  3.4:  Let  K'  €  K[V,'xViI,  K'  =  <C'>.  and 
K*  6  K[V^xVi|,  K*=<C*>,  be  two  networks 
such  that: 

(1)  V,'  C  V,5  Vi  c  Vi. 

(2)  VCi  €  C  aC*  €  C*  9:  Ci  =  C* 

Then  K*  it  snbnet'work  of  tfpe  e  of  K*.  Notation; 
K'  Ce  K*. 

Note:  The  reason  for  referring  to  these  subnetworks  as 
types  b  and  e  is  to  make  this  notation  consistent  with  the 
definitions  of  subsystems  in  [11|. 

Definition  3.5.  Let  K'  6  K[V,'  x  Vi],  K'  =  <C'>,  and 
K*  €  KfV|*  X  Vij,  K*  =  <C*>,  be  two  networks 
such  that: 

(1)  V,'  =  V,»  Vi  =  Vi. 

(2)  C'  =  C*. 

Then  K*  io  eqaal  to  K* .  Notation;  K'  =  K*. 
Theorem  3.3;  Let  K‘ 6  K(V,'xVil,  K'  =  <C'>,  and 
K*  6  K[Vf xVil,  K*  =  <C*>,  be  two  networks. 
If  VCi  6  C  ac*  €  C*  3:  Ci  C  C*.  then 
K*  £b  K*. 


Proof:  (1):  Show  V,' C  V,-. 

(VCi6C‘)  (aC*€C')  (CiCC*) 

-  ivci  e  C)  (Ci  c  ci„,) 

-*  (VCi  €  C)  (s(Ci)  C  s(C*„,)) 

(U  »(Ci)  C  u  »(C5„,)) 

-  (U  s(Ci)  C  U  s(C5))  -» V,'  c  vl 
(2):  ShowViCV'i. 

Similar  to  (1)  except  replace  the  s  set  by  the  d  set. 

□ 

Theorem  3.7:  Let  K' £  K[V,'xVil,  K'  =  <C'>,  and 
K*  €  K[V[*x Vi),  K*  =  <C*>,  be  two  networks. 
If  VCieC  aC;€C*  9:Ci=C*,  then 
K*  Cc  K- 

Proof.  Show  V,'  Q  Vf  and  Vi  C  Vl 

The  proofs  are  similar  to  proof  of  Theorem  3.5.  □ 

Theorem  3.8:  Let  K'  €  K[V,'xVi|,  K*  =  <C'>,  and 
K®6K[V,*xViI,  K-  =  <C*>,  be  two  networks. 
If  C*  =  C‘  then  K'  =  K*. 

Proof;  (1):  Show  V,'  =  Vf. 

C  =  C*  -•  s(C')  =  s(C*)  —  V,'  =  Vf. 

(2):  Show  Vi  =  Vi. 

C  =  C-  -*  d(C')  =  d(C*)  -  Vi  =  Vi.  Q 

IV.  Composition  and  Decompoattion  of  Networks 
This  section  describes  a  “horizontal*  composition 
and  decomposition  of  networks.  The  discussion  here  is 
presented  for  the  composition  of  two  networks  into  one 
and  the  decomposition  of  one  network  into  two.  How* 
ever,  it  can  be  generalized  into  the  composition  of  n  net¬ 
works  into  one  and  decomposition  of  one  network  into  n, 
n  >  3.  What  is  meant  by  the  horizontal  composition  of 
two  networks  K*  and  K'  is  that  Vj*  H  Vj*  =  0  and 
V^  (H  V3  =  O.  Similarly,  the  horizontal  decomposition 
of  K  into  two  networks  K'  and  K'  will  result  in 
V|'  n  Vf  =  0  and  V^  fl  V5  =  0.  Two  types  of  compo¬ 
sition  (decomposition)  are  described.  One,  the  a- 
composition  (decomposition)  corresponds  to  the  physical 
situation  where  the  controls  of  the  individual  subnet¬ 
works  of  the  network  are  independent.  The  other  type  is 
the  r -composition  (decomposition),  which  corresponds  to 
the  physical  situation  where  the  controb  of  the  individual 
subnetworks  of  the  network  are  dependent  upon  one 
another. 

This  section  conceptually  consists  of  two  parts.  In 
part  one  the  definition  of  the  u-composition  is  given  and 
some  of  its  basic  properties  are  presented.  In  part  two 
the  definition  of  the  ^composition  is  given  and  its  proper¬ 
ties  are  described. 
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Definilton  4- 1  '  K*  €  K[V|'  x  V^|,  K*  =  <C*>,  &ad 
K*  €  KfV'f  X  V5],  K*  =  <C®>,  b«  two  networks 
such  that:  {V’l*  U  Vq)  H  (Vf  U  V5)  =  0.  Define 
9-map  as  follows;  K*  «r  K'  =  <C'>  9  <C*>  ^ 

<{c^  uc=j  c,'ec',c,-GC=}>. 

This  describes  the  composition  o(  two  networks  where  the 
controls  of  the  two  networks  are  independent  from  one 
another. 

Lemma  4  ~-'  Let  K‘  E  K(V'|'  x  VqI  and  K*  €  K[Vf  x  V^j 
be  two  networks  such  that; 

(V,‘  U  v<i)  n  |V,-  u  V6)  =  O 

Then  K*  <t  K'  =  K- .r  K'. 

Proof:  Obvious  from  the  definition  of  9-map  and  commu* 
tativiity  property  of  set  union. 

□ 

Theorem  4-^'  Let  K’ £  KfV'i*  x  V^],  K*  =  <C*>,  and 
K*€  K[\'i' X  V5I.  K*  =  <C*>,  be  two  networks 
such  that;  (V,'  U  V<i)  fl  (Vf  U  V^)  =  0.  Then 
K'  <T  K=  €  Ki(VV  U  Vf)  X  (V^  U  V3)|. 

Proof:  Let  {Cp'  U  C*  |  C,'  6  C,  C}  £  C*}  =  C*. 

Cl  €  C».  and  CI(V,'  U  V^)  x  fV'i  U  V*)]  =  C*. 

(1) :  ShowC^CC* 

Clearly  C’  C  |(V,'  U  V,^)  x  (V^  U  V^)!. 
Must  show  nondestructivity. 

<Up,Ufc>,  <u„Ui>  6  C*  —  three  cases. 

(11);  <u.,Ufc>,  <u,.Ui>  €  Cp*. 

6  C  —  Ufc  ^  ua- 

(1.2).  <u»,U4>,  <u,,ua>  £  C,*. 

C-  £  C-  j*  ua- 
(13);  <u.,Ufc>  £  C^,  £  C'.  and 

<u.,ua>  £  C*  C*  €  C*. 

(VV  U  Vi)  n  (V,*  U  Vi)  =  0 
-  Vi  n  Vi  =  0  -  Ufc  ua. 

(1.4);  (11).  (1.2),  and(1.3)-*Ci€C* 

—  C*  C  C*. 

(2) ;  Show  s(C’)  =  V,'  U  V,* 

s(C*)  =  s({Cp'  U  C*  I  Cp'  £  C,  C*  €  C*}) 

=  {s(Cp')  u  s(c*)  1  c;  €  C‘.  C*  €  C*}  = 
{s(Cp')  I  Cp'  €  C'}  U  {s(C*)  I  C*  €  C*}  = 
s(C')  U  s(C»)  =  V,'  U  V,* 

(3) ;  Show  d(C’)  =  Vi  U  Vi- 

Similar  to  2  except  replace  the  s  set  by  the 
d  set. 

(4) :  Show  I  C'l  >  2. 

(4.1):  |C»1  = 

I  {Cp'  u  c*  a  Cp'  €  C.  c*  €  C*)| . 


(4.2) ;  Cp'  U  Cf  C,'  U  C,*,  p  s  or  r  ^  t  ^  all 

C^  are  distinct. 

(4.3) :  (4.1),  (4.2)- jC^  =  |C'|  -jC^  >2-2=4. 

□ 

Lemma  4.4:  Let  K'  £  KfV,'  x  Vi),  K*  £  K(V,*  x  Vi), 
and  £  K[V|*  x  Vi|  be  three  networks  such  that 
(V,‘  U  Vi)  n  (V,"  U  Vi)  =  0.  a  b,  a.b  =  1,2,3, 
then  (K'  9  K*)  <t  K>  =  K'  <r  (K-  <t  K*). 

Proof:  Obvious  from  the  definition  of  7-map  and  the 
associativity  property  of  set  union. 

□ 

Definition  4-o:  Let  K£K[V|  x  Vq]  be  a  network.  Let 
{K',Kv..,K''|  K'£K[V/xViJ}  be  a  set  of  net¬ 
works  such  that:  K  =K'7K®7  •  •  ■  K".  Then 

(1)  K'  7  K*  7  •  •  •  K*  is  called  a  9-deeompoailion 
of  K. 

(2)  {K',  K* . K*}  is  called  a  9-deeompoiition  eel 

ofK. 

(3)  K'  is  called  a  9  -deeompoaUion  element  of  K. 

(4)  K  is  the  7  -eompoeition  of  K'  7  K*  7  •  •  ■  K*. 
Definition  4-S:  Let  K  £  K(V|  x  Vq]  be  a  network.  If  the 

only  possible  7-deeomposition  is  K  =  K'  then  K  is 
called  a  9 -prime  network. 

Definition  4.7:  Let  K  £  K[V|  x  Vq)  be  a  network  and  let 
K  =  K'.  Then  K*  b  called  the  trivM  7- 
deeompoeition  of  K. 

Lemma  4.8:  Let  K  £  K(V|  x  Vq)  be  a  network.  Then  K 
has  a  7-decomposition. 

Proof:  Let  K  =  K'  be  the  trivial  7-decomposition  of  K. 

□ 

Definition  4.8:  Let  K  £  K(V|  x  VqJ  be  a  network.  Let  K 
=  K*  7  K*  7  •  •  •  K*  be  a  7-compasitk>n,  where 
Vj.  K*  b  a  7-prime  network.  Then 

K'  7  7  '  -  -  K*  b  called  a  9-compoeitiom  prime 

ofK. 

Thb  decomposition  can  be  used  as  a  canonical  form  of  a 

network.  Notice  that  this  implies  V|  =  and 

i»l 

B 

Vq  =  UVq  (where  the  notation  D=AUB  means 
D=AUB  and  AnB=0). 

Theorem  4.10:  Let  K  £  K(V,  x  Vq!.  K  =  <C>,  be  a 
network.  Let  K  =  K'  7  K*  7  •  •  ■  K*  be  any  a- 
decomposition.  Then:  n  <  log]  |  C| . 

Proof:  Let  K'  =  <C'>. 

(1) ;  b  a  network  —  |C^  >  2. 

(2) :  |C|  =  n  |C'|  >2*. 

i=i 

(3) :  n  =  log]  2*  <  log,  |  C| .  □ 


I7S 
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This  can  be  used  as  an  upper  bound  on  number  of  net' 
works  in  a  o-decomposition  set. 

Theorem  4- ft-  Lsl  K  6  K[Vj  x  Vq),  K  =  <C>,  be  a 
network. 

(1)  If  K  has  a  nontrivial  o-decomposition  then 
|C  I  is  not  a  prime  number. 

(3)  If  |C  I  is  a  prime  number  then  K  does  not 
have  a  nontrivial  ^-decomposition. 

Proof:  Follows  from  the  proof  of  Theorem  4.10.  □ 

This  counting  principle  can  be  used  as  a  necessary  condi¬ 
tion  on  a  ^-decomposition  of  a  network. 


Theorem  4- IS:  Let  K‘  6  K[V,‘  x  V^].  K*  =  <C'>,  and 


K-  e 

k[V,-  X  Vg],  K®  =  <C®>,  be  two  networks 

such 

that:  (V,'  U  V^)  n  (Vf  U  V3) 

=  0. 

Let 

K»  = 

K'  9  K*  be  a  9-composition. 

(1) 

If  Oc  6  C-  then  K'  Cc  K*  where  Oq  is 

the 

correspondence  consisting  of  no  edges,  i.e. 

,  no 

connections  between  the  set  of 

inputs 

and 

the  set  of  outputs. 

(2) 

If  Oc  (£  C*  then  K'  Cb  K*, 

but 

not 

K'  Cc  Kf 

(3) 

If  ©c  6  C  then  K-  Qe  K*. 

|4) 

If  ©c^C  then  K*CbK*, 

but 

not 

K*  Cc  K». 

Proof: 

Cue  1:  Show  K'  gc  K*. 

(1): 

K*  =  K*  9  K®  —  VCi  €  C,  MCI  €  C* 

3Cp*  €  C*  3;  Cp*  =  Ci  U  Cl 

(2): 

(1)  and  ©c  6  C* 

-•  VCieC  3Cp*€C*  3:  Ci 

II 

0 

—  K*  Cc  K*. 

Cue  S:  Show  K'  Qb  K*  but  not  K'  Cc  K*. 

(1): 

Same  as  Case  I. 

(2): 

(1)  and  ©c  ^  C*  -*  (VCi  €  C‘ 

3Cp*  €  C* 

3  CicC*)  and  (VCicC  aC*€C*3 
Ci  =  Cp*)  -•  K‘  Cb  K*  and  not  K'  ^c  K*. 
Cue  S  and  4:  Same  as  Case  1  and  2  by  the  comrouta- 
ti  ity  of  the  ff-compositioo  (Lemma  4.2).  □ 

Definition  4. IS:  Let  K* €  K(V,' x  V<J|,  K'  =  <C'>.  and 
K*eK(V,*x  V3|,  K*  =  <C*>,  be  two  networks 
such  that: 

(1)  (V,'  U  V^)  n  (V,*  U  Vg)  =  0,  and 

(2) |C'|  =  (C»|. 

Define  -m«p  as  follows; 

(1)  Define  a:  C'—C*.  map  1:1  and  onto. 


(2)  K'r„K*  =  <C‘>r.<C*>  A<{Cp'UC,-) 

o(Cp')  =  c,^  Cp'eC,  c*ec*}  >. 

This  describes  the  composition  of  two  networks  where  the 
controls  are  dependent  in  the  sense  that  choosing  a  Cp'  in 
C*  means  a{Cp)  must  be  selected  in  C".  Thus,  the  a 
map  exactly  specifies  how  the  controls  are  dependent. 
The  basic  difference  between  the  o-map  and  r^-map  is  as 
follows.  Suppose  K*  =  <C'>  and  K*  =  <C*>. 

If  K*  =  K‘  <r  K*  =  <C»>,  then 

(a)  1C»|  =  IC'I  -IC^I  wd 

(b)  Cp  is  a  subset  of  {  C*|  correspondences  in  C^. 

If  K*  =  K*  r„  K*  then 

(a) lC»|  =  |C'|  =  |C*|  and 

(b)  Cp  is  a  subset  of  one  correspondence  in  C*. 
specifically  Cp*  U  o(Cp'). 

Definition  4. 14:  Let  K€K[V|  x  Vgl  be  a  network.  Let 

{K',K- . K"|  K'6K[V/xV^|}  be  a  set  of  net. 

works  such  that;  K  =K' •  •  •  K".  Then 

(1)  K'  »■»  K*  r„  •  ■  ■  K"  is  called  a  r-deeomfoeition 
ofK. 

(2)  {K*,  K*  is  called  a  r-deeompotition  set 

ofK. 

(3)  K'  is  called  a  >  decomposition  element  of  K. 

(4)  K  is  the  r-eomposition  of  K‘  r,  K*  r,  •  •  •  K“. 
Definition  4-15:  Let  K  €  K[V,  x  Vq!,  K  =  <C>,  be  a 

network.  It  there  exist  K*  €  K(V|'  x  V(J), 

K'  =  <C'>.  and  K*  6  K[Vf  x  VS). 

=  <C*>,  two  networks  such  that; 

(1)  V,'  UVf  =  V„  and  (2)  UV5  =  Vq.  then: 

(1)  If  K*  r„  K*  =  K,  then  K  is  a  r-partitionaUe 
network. 

(2)  If  K'  9  =  K,  then  K  is  a  strictly  <r- 

partitionable  network. 

(3)  If  K'  9  K*  S!  K  and  K'  9  K-  Dc  K,  then  K 
is  a  o-partitionatle  network. 

Note  that  strictly  9-partitionable  implies:  |C|  = 

I  C'l  -j  C*|  and  C  =  (C;  U  C*|  C,'  E  C,  C*  €  C*}.  In 
contrast  9-partitionable  implies:  |C|  <  |C'|  ’|C^|  and 

c  c  (C,'  u  c.'  I  c;  €  c.  C*  e  c®}.  if  K  is  a  r- 

partitionable  network  then  it  is  also  a  9-partitiooable.  It 
is  not  strictly  9-partitionable  because  it  is  strictly  9- 
partitionable  only  if  |C'|  *|C^|  =  |  C|  and  it  si  r- 
partitionable  only  if  |C'|  ~|^|'  ^hich  implies 

I  C'l  =1C*|  =|C|  =1;  however,  |C'|,  |  C*} ,  |C|  >2. 
by  Definition  3.1.  Also  note  that  if  there  exbts  a  9-prime 
composition  of  K,  then  K  is  a  strictly  9-partitiooable  net¬ 
work. 
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V.  Partitionability  Algorithm 
In  this  section  an  algorithm  is  presented  that  has  an 
input  any  general  network  (with  an  arbitrary  topological 
structure)  and  which  produces  one  of  four  possible  outr 
puts. 

(1)  The  network  is  not  partitionable. 

(2)  The  network  is  r-partitionabie. 

(3)  The  network  is  strictly  ^-partitionable. 

( I)  The  network  is  u-partitionable  . 

The  engineering  interpretation  of  the  four  outputs  is  as 
follows: 

(1)  The  network  is  not  partitionable  into  disjoint  sub¬ 
networks. 

I'J)  The  network  is  partitionable  into  subnetworks  with 
common  control  signals  that  are  dependent  upon  one 
another  and  the  combination  of  the  subnetworks  will 
exactly  generate  all  interconuection  patterns  of  the 
original  network. 

(3)  The  network  is  partitionable  into  subnetworks  with 
independent  control  signals  and  the  combination  of 
the  subnetworks  will  exactly  generate  all  intercon¬ 
nection  oatterns  of  the  original  network. 

(4)  The  network  is  partitionable  into  subnetworks  with 
independent  control  signals  and  the  combination  of 
the  subnetworks  will  generate  a  superset  of  intercon¬ 
nection  patterns  of  the  original  network. 

The  algorithm  can  be  programmed  on  a  computer 
and  if  the  output  of  the  algorithm  is  (2)  or  (3)  then  it  will 
produce  a  more  efficient  implementation  of  the  network 
in  terms  of  data  path  hardware  and  possibly  control 
implementation  In  case  (4),  even  though  a  superset  of 
the  states  of  the  original  network  is  obtained,  the  imple¬ 
mentation  produced  by  the  algorithm  will  be  efficient  in 
most  instances.  The  following  definitions  are  needed  to 
discuss  the  algorithm  and  prove  its  correctness. 

Definttion  5  1:  Let  K  €  K[V'|  x  Vg],  K  =  <C>.  Let 
C„  €  and  <v,,v^>  £  C„  be  an  edge  (directed). 

Denote  the  undirected  arc  associated  with  the 
directed  edge  of  <v.,Vb>  by  <v„v^>.  Let 
x  Vg]  ^  {<v„Vfc>  I  <v„v^>  €  C„. 

V  C„  €  C}.  Then  GfVj  x  Vg|  is  the  •nJerifttif 
nndirecled  graph  of  K. 

Definttion  5.2:  Let  G[Vi  x  Vg)  be  the  underlying 
undirected  graph  of  K  €  KfV|  x  VqI  Then  the 
connected  subgraphs  of  GfV|  x  Vg]  are  called 
components  of  G[Vi  x  Vgj . 
tSolation;  Components  are  denoted  by  B', 

Denote  the  vertices  associated  with  B'  by  Vi'  and  V^, 
V]  C  V|,  V^eVg.  In  a  component  B'  there  exists  a 


path  from  each  node  to  every  other  node  and  there  is  no 
path  between  any  two  nodes  from  different  components. 
Clearly  G|V,  x  Vgj  =  U  B',  U  V,'  =  V,,  and  U  =  Vg. 

r  r  r 

Definition  5  3:  Let  G[V'|  x  Vg]  be  the  underlying  graph  of 
K  €  KfV'i  X  Vg],  K  =  <C>.  Let  C.  €  C  and  let 
B'  be  a  component  of  GfVj  x  Vg],  Define  the  pro¬ 
jection  p  of  C„  onto  B'  as  follows: 
p(C„,B')  A  {<v.,v|,>  e  C„|  <va,vfc>  €  B'}. 
Lemma  5.4:  Let  GfVi  x  Vg]  be  the  underlying  graph  of 
K  €  K[V,  X  Vg],  K  =  <C>.  Let  C,  €  C  and  let 

(B'.B^ . B")  be  the  set  of  all  components  of 

G(V,  X  Vg],  Then 

C„  =  P(C„,B')  U  P(C.,B*)  U  •  P(C..B“). 
/’'■"‘’/■■(I);  Show 

p(C„,B')  n  p(C„,»)  —  B‘  =  Bi. 

(1.1) :  p(C„,B')  n  P(C„,B*)  -r 

€  P(C„,B'). 

€  PlCa.Bi). 

(1.2) :  <v>,V|,>  6  P(C,.B*)  -* 

<V^V^>  e  c.,  <v„v^>  6  B‘. 

(»  3):  <v„vi,>  e  p(C..Bi)  -» 

<v»,Vfc>  6  C„,  <v„v^>  6  B*. 

(»  <•);  <^>e».<^>eBi,  and 
G(V,xVo]  sUB'-eB^sB*. 

f 

(2) :  Show  C„  =  U  p(C„.B‘). 

1 

(2.1):  Show  C.  C  U  p(C„,B‘). 

1  _ 

€  Cn,  -•  <V^>  €  Gjy,  X  Vg) 
-•  SB',  <v„Vi>€  B*  -• 
p(C„,B»)-» 
<V''b>eUp(C„,B‘). 

I 

(2  2):  Show  C„  2  U  P(C„,B‘). 

I 

<v.,v^>  e  u  P(C„,B‘)  -*  3B', 

1 

<»..»b>  e  p(c„B')  -♦  <v^Tk>  e  c,. 

(3) :  (1)  and  (2)  -  C,  =  M  p(C.,B‘).  ^ 

Definition  5  5:  Let  G[V|  x  Vg]  be  the  underlying 
undirected  graph  of  K  6  KfVi  x  Vg|,  K  =  <C>. 
Let  B’  be  a  component  of  G[V|  x  Vgj.  Define  the 
residue  set  modufo  B'  as  follows; 

r(B')  {p(Cfc,B-)  I  VCi€C}. 

Theorem  5.8:  Let  B'  be  a  component  of  the  underlying 
graph  G(V,  X  Vg)  of  K  €  K(V,  x  Vg).  K  =  <C>. 
Let  r(B')  be  the  residue  set  modulo  B^,  B'  over 
\{  X  VJ.  If  I  r(B')|  >  2  then  <r(Bn> 
€  K[V|'  X  V^).  <r(B')>  is  eaUed  a  eomponent 

network  of  K  denoted  by  K(B'). 


(1) :  Show  C,  €  r(B')  -*  C,  €  C(V/  x  V^j. 

6,€  r(B')=  {p<C.,B')3=C,6C}  - 

3C.  6  C.  C,  =  p(C„B')  - 
e,  €  C(Vf  X  V61. 

(2) :  Show  s(r<B'))  =  Vf. 

(2.1):  Show  s({p(C».B')  a  C.  e  C))  C  VI. 

u.€s({p(C^B')aCj€C))  -*  3Ch€C. 
<u„Ui>  e  Cfc,  <u„Ui>  e  B'  -• 

(2.2) :  Show  »({p(C,.B')  5  C,  €  C))  D  V{. 

».  €  V,'  -*  <u„Uk>  €  S'  -•  3C|,  e  C. 
<u.,Uh>  €  Cfc  -*  <u„Ufc>  €  p(Cfc.B') 

.  -  U.  e  s({p(C^B')  a  C,  €  C})  =  s(r(B')). 

(2.3) :  ‘(2.1),  (2.2)  -  j(r(B'))  =  V{. 

(3) :  Show  d(r(B'))  =  V^. 

Same  as  (2)  except  replace  the  s  set  by  the 
d  set. 

(4) ;  Show  1  r(B')|  >  2. 

By  Theorem  hypothesis. 

(5) :  (1).  (2).  (3)  and  (4)  -  <r(B')> 

€  K(V,'  X  v^i.  a 

Given  an  arbitrary  network  it  is  possible  that 
I  r(B')|  =  1  for  some  B';  that  is,  p(C.,B')  s  p(Ck,B'). 

V  €  C.  Then  r(B')  does  not  constitute  a 

reconfigurable  network  as  deSned.  To  handle  this  case 
from  an  engineering  point  of  view,  do  the  following,  (f  a 
network  contains  such  a  B',  that  part  of  the  network  is 
constant,  that  is,  it  has  a  single  state  only.  So  to  remove 
this  constant  part  from  the  network  K  ~  <C>  do  the 
following. 

(1)  Construct  separately  the  constant  part 

r(B'),  VB'  3c.  I  r(B')|  =  1,  as  a  set  of  aonreconflgurable 
links.  _ 

(2) K'  ^  <{C,-<v„Vfc>l  C,  €  C,V<v^v^>  €B', 
VB'  3c  jrlB')!  =  1}>. 

K'  then  contains  only  the  reconflgurable  links.  In  the  fol¬ 
lowing  it  is  assumed  that  the  constant  part  of  the  net¬ 
work  has  been  removed  already. 

If  G(Vj X  Vj)  =  B',  then  K  =  <r(B')>.  In  thb  case, 
K  is  a  cr^prime  network  and  is  not  partitionahle. 

The  followiiig  Lemmas  and  Theorems  are  shown  for 
the  ease  of  G[V|  x  Vg]  having  two  components,  B'  and 
B*,  for  reasons  of  simplicity.  They  are  all  applicable  to 
the  case  of  B'3* . B*,  n  >  2. 

Lemma  5.7;  Let  (B',  B^}  be  the  set  of  components  of  the 
underlying  graph  G[V,  x  Vq]  of  K€K[VixVol, 
K  =  <C>.  Let  (r(ff)(  =(C(,  Vi.  Then  3  r. 
such  that  if 

<C»>  =  K(B')  r,  K(B*)  then  C  C  C*. 


P'oo/  (l):  |r(B')|  =|C|.3  i- 

This  b  necessary  and  sufficient  condition  for 
the  exbtence  of  a. 

p(C„B')  p(C,.B'),  VC.,C,  eC.xrty,  \/t. 

(2) :  <C*>  =  K(B')  K(B*)  - 

C»  =  {p(C.,B‘)  U  p(Ck.B-)l  a(p(C„B'))  = 
p(Cb,B-),  C»  €  C.  Ck  6  C). 

(3) :  Let 

o:  {p(C„B')|  C,  e  C)  -  {p(Cfc,B*)|  G  C), 
«(P(C..B'))  =  P(C„B*). 

(4) :  C.  G  C  —  C,  =  p(C.,B')  U  P(C.,B*). 

(5) .  (2),  (3)  and  (4)  -•  C,  G  C*  -•  C  C  C*.  □ 
Lemma  5.8:  Let  (B',  B^}  be  the  set  of  components  of  the 

underlying  graph  G(Vi  x  Vg)  of  KGKfVixVg], 
K  =  <C>.  Let  I  r(B‘)l  =  1  Cj ,  Vi.  Then  3  r, 
such  that  if 

<C*>  =  K(B‘)  r„  K(B=)  then  C*  C  C. 

Proof:  (ij;  (1)^  (2),  and  (3)  from  Lemma  5.8  proof. 

(2) :  C,GC»-*  C,=p(C.,B')Up(C.,B*). 

(3) ;  (1)  and{2)  —  C,GC>  — CSC*.  ^ 

Theortm  5.9:  Let  (B',  B*}  be  the  set  of  compoaenU  of 
the  underlying  graph  G[V|  x  Vg]  of 
K€K[V,xVol,K  =  <C>.  Letlr(B-)l  =|C|, 

Vi.  Then  3  r,  such  that  K(B')  r„  K(B^  =  K. 

(1) :  Let 

o:  {p(C.,B‘)l  C.GC)  {p(C.,B*)|C,€C}. 
a(p(C„B'))  =  p(C.,B»). 

Let  K(B')  r.  K(B*)  =  <C*>. 

(2) :  Lemma  5.7  -•  C  C  C*. 

(3) :  Lemma  5.8  C*  Q  C. 

(4) :  (2),  and  (3)  -•  C*  =  C. 

(5) :  Theorem  3.8  -•  C*  =  C 

K(B')  r„  K(B*)  =  K.  Q 

Lemma  5.  JO:  Let  {B‘,  B*}  be  the  set  of  components  of 

the  underlying  graph  G(Vj  x  Vg]  of 

K€K(V,xVgl.  K  =  <C>.  Let 
K(B')  «T  K(B*)  =  <C*>.  Then  C  C  C*. 

(1) :  C.  G  C  -  C.  =  p(C.,B')  U  P(C,.B*). 

(2) :  <C*>  =  K(B')<r  K(B*)  =  <{p(C^B‘)  ( 

C,6C}>  <r  <{p(C4,B*)  I  CtGC}>  -* 
C„  G  C*  -•  C  C  C».  Q 

Theorem  5.11;  Let  {B‘,B*}  be  the  set  of  componenU  of 

the  underlying  graph  G(V|  x  Vgj  of 

K  G  KfV,  X  Vg|,  K  =  <C>.  Let  K(B')  <r  K(B*) 
=  <C>>.  Let  |r(BM|  •  |r(B*)|  =  \C\.  Then 
K(B')  <r  K(B*)  =  K. 
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Pro.)/  ,,,  By  Lemma  5.10  C  C  C* 

By  Theorem  hypothesis  |  r(B*)|  •  |  r(B-)( 

=  I  C’l  =  |Cl  -•  c  = 

(■J)  By  Theorem  .3.8  and  (I)  ”•  K(B*)  <r 

K(B-|  =  K.  O 

Theorem  j  Let  {B'.B"}  be  the  set  of  componeats  of 
the  underlying  graph  G[V'|  x  Vg]  of  K  G 
KiV,  X  Voi.  K  =  <C>.  Let  K(B')  <r  K(B-)  = 
<(•’>.  Let|rlB‘)|  >|C|.  Then  K(B') 

(7  KiB")  Dc  K  and  K(B')  <t  K(B')  x  K. 

Proof  I  ij.  gy  Lemma  5  12  C  C 

Theorem  hypothesis  |  r(B*)|  •  |  r(B*)|  = 

I  <^-’1  >  iC|  -  C  c  C^, 

12):  The<)rem  .3.7  and  (1)  “•  K(B')  <r  K(B*) 

Dc  K.  and  Theorem  3.3  and  (1)  “•  K(B') 
<T  K(B-)  X  K 

□ 

Note  that  by  dehnition  K(B'),  Vi  is  a  tr-prime  net¬ 
work. 

Dtfmilton  S  13  If  B'.B".  ...B"  are  the  components  of 
GfV’i  X  Vg],  where  G[\'i  x  V'g]  is  the  underlying 

graph  .)f  K,  then  K(B*),  K(B") . K(B"*  is  a  prime 

decomposition  of  K. 

The  prime  decomposition  of  K  is  unique  and  can  be  used 
aa  a  canonical  form  of  the  network. 

The  aig.irithm  is  presented  below.  The  input  is  an 
.irnitrary  network  K  G  KfV,  x  \'g|.  K  =  <C>,  with  the 
const.ant  part  removed.  The  output  is  ine  of  (1)  K  is  not 
partitionable.  (2)  K  is  r-partitionable,  (3)  K  is  strictly  <t- 
partiti onable.  |4)  K  is  ir-partitionable.  In  cases  (2).  (3). 
an.J  It)  the  algorithm  also  produces  the  component  net- 
'vorks  K|B'),  K(B*) . K(B"),  in  step  (6). 

Algonthm 

Input  K  G  K[\'|  X  Vg|,  K  =  <C> 


Oiilpiit; 

(1): 

K  is  not  partitionable. 

or 

12); 

K  IS  r-partitiooable. 

or 

13): 

K  is  strictly  ir-partitionable. 

or 

It): 

K  IS  (T-partitiooable. 

1 1)  Construct  the  underlying  graph  G(V|  x  Vg)  of  K. 

12)  Find  components  B'.B* . B"  of  G(V,  x  Vg). 

(3)  If  (n  =  t)  return  (1) 

(4)  Find  p(C^,B'),  VC„  €  C.  i  =  1,2 . n. 

(.5)  Find  r(B')  =  (p(C„,B')|  VC^GC}.  i  =  1,2 . a. 

(6)  Construct  K(B')  =  <r(B')>,  i  =  1,2,. ...n. 

IT)  If  |r(B')|  =|C|,  r  =  1,2 . n) 

then  return  (2). 


(8)  irifll  r(B')l)  =|C| 

I 

then  return  (3). 

(0)  Else  return  (4). 

Proof  of  eorreetneas:  The  proof  is  directly  implied  by 
Theorems  5.9,  5  11,  and  5.12.  O 

The  outputs  of  the  algorithm  can  be  used  in  the  fol¬ 
lowing  ways.  If  the  output  is  'T'  (not  partitionable), 
then  the  system  designer  will  know  that  the  network  can¬ 
not  be  divided  into  individual  subnetworks.  If  the  output 
IS  "3"  (strictly  ir-partitiooable),  then  the  network  can  be 
partitioned  and  the  composition  of  the  component  net¬ 
works  will  produce  a  set  of  correspondences  identical  to 
that  of  the  original  network.  Note  that  if  a  network  is 
strictly  ir-partilionable  it  is  not  r-partitionable  nor  a- 
partitionable.  If  the  output  is  "4*  (^-partitionable),  then 
the  network  can  be  partitioned  and  the  composition  of 
the  component  networks  will  produce  a  set  of  correspon¬ 
dences  that  is  a  superset  of  that  of  the  original  network. 
If  the  output  is  "2*.  the  network  is  r-partitionable.  Any 
network  that  is  r-partitionable  is  also  ^-partitionable. 
However,  if  a  network  is  r-partitionable  then  |  r(B')|  = 
j  r|B‘)|  =  I  G| ,  1  <  i.j  <  n,  which  is  not  true  in  general 
for  a  (T-partitiooable  network.  Since 
I  r(B')|  =|r(B')j  =  |  C| ,  1  <  i,  j  <  n,  the  number  of 
correspondences  in  each  component  network  <r(B')>  is 
the  same  ||C|)  for  i.  1  <  i  <  n.  This  property  means 
that  the  same  control  decoders  can  be  used  in  all  network 
components  in  a  r-partitionable  network. 

The  output  of  the  algorithm  applies  only  to  the 
reconSgurable  part  of  the  network  because  partitionabil- 
ity  is  defined  in  terms  of  a  decomposition  into 
"reconSgurable'  network  components  (|r(B‘)|  >1).  If 
the  original  network  bad  some  B'  such  that  |  r(B')|  =  1, 
then  those  constant  componeot(s)  should  be  added  to  the 
network  component) s)  generated  by  the  algorithm  in 
order  to  reproduce  the  original  network. 

There  are  less  strict  definitions  of  partitionability 
than  the  one  used  here.  Future  work  in  this  area 
includes  the  study  of  the  partitionability  of  networks  if 
some  of  the  network  correspondences  are  not  used,  e.g., 
as  can  be  done  with  the  cube  network  fl3,  14). 

VI.  ConclosioB 

In  this  paper  the  interconnection  network  properties 
of  composition,  decomposition,  and  partitionability  were 
analyzed.  The  partitionability  property  of  interconnec¬ 
tion  networks  for  parallel  computer  systems  is  important 
for  (1)  resource  allocation,  (2)  fault  tolerance,  and  (3) 
efficient  hardware  implementation  as  discussed  in  the 


introduction.  The  results  presented  here  are  valid  across 
all  network  topologies. 

In  summary,  a  general  model  of  interconnection  net¬ 
works  was  used  to  describe  composition,  decomposition, 
and  partitionability  properties  of  networks.  An  algorithm 
for  network  partitioning  was  presented  and  proven 
correct. 
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The  interconnection  ot  a  large  number  of  priKcssors  and  other  devtcCN  to  form  .» 
paralleiydi^thbuted  computing  system  is  a  research  area  receiving  a  great  deal  ot 
attention.  One  method  is  to  use  a  multistage  network  This  paper  compares  two 
classes  of  multi-stage  netwtirks  by  examining  two  representative  networks  the  Oen- 
eralizcd  Cube  and  the  Augmented  Data  Manipulator  The  two  topologies  arc  com¬ 
pared  using  a  graph  rntxlcl.  By  inicrpreling  the  graphical  representations  ot  the 
networks  in  different  ways,  different  but  lunclionany  equivalent  implementations 
result  The  costs  of  the  various  implementations  are  compared  taking  V  LSI  consid¬ 
erations  intt>  acetiunt.  Finally,  the  robustness  (fault  tolerance)  t>t  the  different  net¬ 
works  j.s  measured  and  contrasted.  •*  I'lss  AvadcniK  Ptcn'.  inv 
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The  interconnection  ot  a  large  number  of  prtKessors  and  other  devices  to 
form  a  parallel/distributed  computing  system  is  a  research  area  receiving  a 
great  deal  of  attention.  Many  different  approaches  to  the  interconnection 
method  have  been  proposed  and  discussed  including  the  use  of  buses  147). 
hierarchies  of  bu.se.s  |44|.  direct  links  |  l.f|,  single-stage  networks  |21 ).  mul¬ 
tistage  networks  19,  22.  30,  38|.  and  crossbars  149).  An  important  aspect  ot 
this  research  is  the  evaluation  and  compari.son  ot  the  proposed  approaches 
16.  16,  40.  45).  The  conclusion  most  often  reached  is  that  the  best  scheme  to 
use  in  a  particular  design  depends  highly  upjJii  the  intended  application, 
performance  reejuirements,  and  cost  constraints  Once  a  connection  method 
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is  chosen  (e.g.,  single-stage  net^vork),  a  specific  design  must  be  decided  upon 
and  then  implemented.  During  this  phase  of  a  system's  specification,  it  is 
important  for  the  designer  to  understand  fully  the  differences  and  similarities 
between  candidate  designs. 

This  work  is  motivated  by  an  ongoing  study  of  methods  to  model  distrib¬ 
uted  systems  and  an  examination  of  networks  suitable  for  use  in  the  PASM 
|41 1  and  PUMPS  ( 10]  systems.  Two  classes  of  multistage  networks  that  have 
been  considered  for  use  in  these  and  other  systems,  cube  type  and  data 
manipulator  type,  are  investigated  in  this  paper.  Spccilically.  graph  mixJels 
are  u.sed  to  quantify  the  difference  between  the  Generalized  Cube  and  Aug¬ 
mented  Data  Manipulator  (ADM)  nctworls  in  terms  of  cost  and  robustness 
(fault  tolerance).  Ciraph  models  are  used  because  they  are  unencumbered  by 
implementation  details  and  are  an  excellent  tool  for  representing  an  essential 
characteristic  of  a  network:  its  topology.  They  also  facilitate  comparison  of 
this  work  with  other  studies  (e  g  ,  |5,  I9|) 

The  Generalized  Cube  and  ADM  networks  are  defined  in  Section  11.  Their 
relation  to  other  multistage  networks  described  in  the  literature  is  also  dis¬ 
cussed  Using  a  graphical  representation,  the  networks'  topologies  are  com¬ 
pared  in  Section  111.  In  Section  IV,  two  functionally  equivalent  imple¬ 
mentations  resulting  from  two  different  graph  interpretations  are  examined  to 
compare  the  cost  of  each  network  Here,  using  VLSI  chips  is  considered  and 
costs  are  compared  relative  to  the  traction  of  a  stage  that  can  be  implemented 
on  one  chip  f  'inally.  Section  V  contains  an  analysis  of  the  robustness  each 
network  exhibits 


11  I  ni  CilM  k.M.IZI  t)  Ct  HI  AND  ADM  Nl  TWORKS 

I'he  (ienerali/.ed  Cube  network  is  a  multistage  cube-type  network  topology 
that  was  introduced  as  a  standard  for  comparing  network  topologies  |39j. 
Assume  the  network  has  N  inputs  and  .V  outputs:  in  Fig.  I.  ,V  =  8.  The 
Generalized  Cube  topology  has  n  =  log;  .V  stages,  where  each  stage  consists 
of  a  set  of  ,V  lines  connected  to  ;V,  2  interchange  boxes.  Lach  interchange  box 
is  a  two  input,  two  output  device.  The  labels  of  the  input  output  lines  entenng 
the  upper  and  lower  inputs  of  an  interchange  box  serve  as  the  labels  for  the 
upper  and  lower  outputs,  respectively  Fiach  interchange  box  can  be  set  to  one 
of  the  lour  legitimate  states  shown  |22|. 

The  connections  m  this  network  are  based  on  the  t  uhr  inierconneciion 
fum  iiini.'t  I  L‘'|.  l.et  P  -  p,,  i  ■  •  •  [>,/)„  In.’  the  binary  representation  of  an 
arbitrary  1  G  line  label.  Then  the  n  cube  interconnection  functions  can  be 
defined  as 


ll//)C,tp.,  ,  ■  •  ■  [},[)„)  [)„  ,  •  •  p,.|p,p,  I  •  •  ■  pi/)„, 


where  0  "  i  -  n,  n  *  P  •  ,V.  and  p.  denotes  the  complement  of  p,.  This 
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STRAIGHT  EXCHANGE 

0  S 

LOWER  UPPER 

BROADCAST  BROADCAST 

FIg.  I,  Generalized  Cube  network  for  A  =  8  |37|  The  four  legitimate  stales  of  an  inter¬ 
change  box  are  shown. 


means  that  the  cube,  interconnection  function  connects  P  to  cube,{P).  where 
cubei(P)  is  the  I/O  line  whose  label  differs  from  P  in  just  the  /th  bit  position. 
Stage  I  of  the  Generalized  Cube  topology  contains  the  cube  interconnection 
function.  That  is.  it  pairs  I/O  lines  that  differ  in  the  ith  bit  position. 

The  ADM  network  is  shown  in  Fig.  2  for  A/  =  8.  It  is  based  on  Feng’s  data 
manipulator  [15].  In  this  network,  a  stage  consists  of  N  switching  elements 
or  nodes  and  the  3/V  data  paths  that  are  connected  to  the  inputs  t>f  a  succeeding 
stage.  Each  node  can  connect  one  of  its  inputs  to  one  or  more  of  its  outputs. 
At  stage  i  of  the  ADM  network,  0  ^  i  <  n.  the  first  output  of  node  j  is 
connected  to  the  input  of  node  (j  -  2')  mod  N  of  the  next  stage;  the  second 
output  is  connected  to  the  input  of  node  j.  and  the  third  output  is  connected 
to  the  input  of  node  (j  +  2')  mod  N.  Because  (j  -  2"  ')  equals  (>  +  2"  ') 
mod  N,  there  are  actually  only  two  distinct  data  paths  instead  of  three  from 
each  node  in  stage  n  -  1  (in  the  figure,  stage  2).  There  is  an  additional  set 
of  N  nodes  at  the  output  stage. 

Both  of  these  networks  are  based  on  the  PM2I  interconnection  functions 
[35].  There  are  2n  of  these  functions  defined  by  PM2.,(  j)  =  j  +  2'  imxl  N 
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Fig  2.  Augmented  Data  Manipulator  network  lor  \  -  8  |37J.  iLowercase  letters  represent 
end-around  connections  ) 

and  PM2  ,(j)  =  j  ~  2'  mod  N  for  0  ^  y  <  N.  0  s  /  <  n.  where  -x  mod 
N  N  -  X  mod  N.  (Note  PM2,„-ii  =  PM2-i„_i,.) 

A  number  of  systems  have  been  proposed  and/or  built  that  use  multistage 
networks  (e  g.,  |7,  8,  24,  34,  41  j).  Among  the  networks  that  have  been 
proposed  are  the  ADM  (381,  baseline  (48],  binary  n-cube  |30],  data  manip¬ 
ulator  (151,  Gamma  (291,  Generalized  Cube  139),  inverse  ADM  (lADM) 
(271,  omega  (221,  STARAN  flip  (91.  and  SW-banyan  (19).  Studies  have 
shown  that  the  baseline,  binary  n  -cube.  Generalized  Cube,  omega,  STARAN 
flip,  and  SW-banyan  (5  =  F  =  2)  networks  are  all  topologically  equivalent 
(3 1 ,  36,  37,  42,  481.  Differences  between  these  networks  are  due  to  proposed 
control  schemes,  whether  or  not  a  broadcast  capability  is  included,  and  the 
method  used  to  number  input  and  output  ports.  All  of  these  networks  belong 
to  the  general  class  of  cube-type  networks  Because  of  the  similarities  among 
these  networks,  a  designer  is  not  faced  with  choosing  between  six  different 
networks;  rather  the  choice  is  whether  or  not  to  use  a  cube-type  network. 

The  data  manipulator,  ADM.  I  ADM,  and  Gamma  networks  are  topo¬ 
logically  identical.  The  differences  between  these  networks  are  the  control 
scheme,  order  in  which  stages  are  traversed,  and  switch  complexity.  The 
switches  in  each  stage  of  the  data  manipulator  are  divided  into  two  groups. 
Each  group  receives  an  independent  .set  of  control  signals  and  all  switches  in 
a  group  respond  identically.  Each  switching  element  of  the  ADM.  I  ADM, 
and  Gamma  networks  is  controlled  individually.  The  stages  of  the  I  ADM  and 
Gamma  networks  arc  traversed  in  an  order  opposite  to  that  of  the  ADM  and 
data  manipulator.  Also,  the  Gamma  network's  switching  elements  are  3  x  3 
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crossbars  (as  opposed  to  selecting  one  input  at  a  time).  One  property  that  these 
networks  have  is  that  for  all  nontrivial  source/destination  pairs  (i.e..  source 
address  #  destination  address)  there  are  multiple  paths  through  the  network. 
For  that  reason,  none  of  the  networks  is  a  member  of  the  general  banyan  class 
[19]. 

The  capabilities  of  the  Gamma  network  are  a  superset  of  the  ADM  and 
lADM  networks.  It  has  been  shown  in  turn  that  their  capabilities  are  a 
superset  of  all  the  cube-type  networks  as  well  as  the  data  manipulator  network 
[36.  37.  42).  Data  manipulator-type  networks,  however,  are  more  complex 
than  cube-type  networks. 

A  common  feature  of  all  cube-type  networks  is  that  there  is  exactly  one 
path  through  the  network  for  each  source/destination  pair.  This  property 
makes  control  schemes  simple  but  any  single  failure  of  a  link  or  switch  will 
disallow  the  use  of  any  path  requiring  the  failed  component. 

Thus  there  exists  the  classic  trade-off  between  cost  and  performance  when 
choosing  between  the  two  network  types.  In  this  paper,  the  network  types  are 
compared,  using  one  representative  network  from  each  type:  the  Generalized 
Cube  and  the  ADM.  Both  networks  have  the  same  number  of  input  and  output 
ports  and  individual  switching  element  control.  Routing  tag  schemes  are 
available  for  the  networks  [22,  28,  38,  39],  so  it  is  assumed  that  they  are  used 
to  implement  network  control. 

Some  aspects  of  the  Generalized  Cube  and  the  ADM  networks  have  been 
compared  elsewhere.  The  ability  of  the  ADM  network  to  perform  all  the 
functions  a  Generalized  Cube  can  was  demonstrated  in  |42|.  In  1 1 1.  the  total 
number  of  unique  permutation  connections  each  network  can  perform  was 
compared.  In  [5],  graph  models  were  used  to  study  multistage  interconnection 
networks  which  have  the  “buddy  property”  (cube-type  networks  have  that 
property)  and  other  networks  including  the  ADM.  In  that  paper  emphasis  was 
on  comparing  the  networks'  permutation  capabilities.  This  paper  is  concerned 
with  comparing  cost  and  robustness  or  inherent  fault  tolerance.  Cost  is  exam¬ 
ined  from  two  points  of  view.  The  first  is  the  common  method  of  counting 
links  and  switching  nodes.  In  this  case,  the  graph  model  with  a  consistent 
interpretation  (two  are  possible)  is  used  to  ensure  a  “fair”  comparison.  The 
second  point  of  view  is  oriented  toward  VLSI  considerations.  Modules  for 
each  network  requiring  roughly  the  same  number  of  pins  are  compared.  The 
change  in  relative  cost  is  also  examined  when  as  much  as  one  whole  stage  is 
placed  on  one  chip.  Robustness  is  measured  by  calculating  the  average  num¬ 
ber  of  network  inputs  and  outputs  affected  by  the  removal  of  a  single  link  or 
switching  element.  The  calculations  are  performed  for  both  of  the  graph 
interpretations  to  be  defined. 


III.  Graph  Modeling:  A  Common  Basis  for  Comparing  Networks 

Graph  models  have  been  used  by  Goke  and  Lipovski  ( 19]  as  the  basis  for 
defining  a  class  of  networks  called  banyans.  The  graphs  used  to  represent 


Fig.  3.  Graphical  representation  of  the  Generalized  Cube  network  for  A/  =  8. 


these  networks  consist  of  nodes  connected  by  directed  arcs.  By  definition,  in 
a  banyan  there  is  one  and  only  one  path  from  input  to  output  1 19]  .  In  this  paper 
the  arcs  are  undirected  and  there  is  no  restriction  on  the  number  of  paths  from 
input  to  output. 

It  has  been  observed  (20,  23)  that  the  Generalized  Cube  network  (Fig.  1) 
has  the  graphical  representation  shown  in  Fig.  3.  This  graph  also  represents 
an  SW-banyan  (with  S  =  F  =  2).  The  graph  can  be  interpreted  a  number  of 
different  ways.  One  is  to  treat  each  node  (vertex)  (a  circle  in  the  figure)  as  a 
switch  and  each  arc  (edge)  (a  line  in  the  figure)  as  a  link.  To  model  the 
network’s  behavior  under  this  interpretation,  the  switch  (node)  shown  in  Fig. 
4a  should  only  connect  one  of  the  input  links,  a  or  b,  to  one  of  the  output 
links,  c  or  d.  An  implementation  based  on  this  interpretation,  for  an  N 
input/output  network,  would  consist  of  n  +  1  stages  of  N  switches,  with 
lines  between  stages  The  TRAC  reconfigurable,  multimicroprocessor  system 
contains  an  SW-banyan  constructed  from  switches  of  this  type  (but  that  have 
two  incoming  and  three  outgoing  links,  i.e.,  5=2  and  F  =  3)  [32]. 

A  second  interpretation  of  the  graph  in  Fig.  3  is  to  treat  the  nodes  as  links 
and  the  arcs  as  forming  interchange  boxes.  For  example,  the  thickened  lines 
in  Fig.  3  can  be  considered  to  represent  the  interchange  box  with  inputs  2  and 
6  (compare  this  to  Fig.  1 )  In  this  case  the  SW-banyan  implementation  would 
have  the  same  structure  as  specified  here  for  the  Generalized  Cube  (assuming 
a  bidirectional  network).  This  interpretation  is  illustrated  in  Figs.  4b  and  c. 
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(a) 


(‘•) 

Fig.  4.  (a)  A  node  from  the  graph  represenling  the  Generalized  Cube  nelwork.  When  equaled 
with  a  switch,  input  a  or  b  can  be  connected  to  output  c  or  d.  (b)  Four  nodes  from  the  graph 
When  the  arcs,  a.  h.  c.  and  d  are  equaled  with  switches,  a  2  x  2  crossbar  is  obtained  (cl  The 
components  of  a  crossbar  that  correspond  to  the  graph  in  (b). 

Each  of  the  arcs  labeled  a  through  d  in  Fig.  4b  acts  as  a  crosspoint  switch  in 
Fig.  4c.  When  viewed  this  way,  the  portion  of  the  graph  within  the  dashed 
lines  of  Fig.  4b  behaves  as  a  2  x  2  crossbar  or  interchange  box.  If  a  and  d 
are  “on,"  the  straight  setting  is  obtained;  b  and  c  "on"  corresponds  to  ex¬ 
change;  a  and  b  “on"  corresponds  to  upper  broadcast;  and  c  and  d  “on" 
corresponds  to  lower  broadcast.  Conflict  trccurs  if  a  and  c  or  b  and  d  are  on 
at  the  same  time.  It  will  be  .shown  in  the  next  section  that  implementations 
based  on  the  first  and  second  graph  interpretations  are  functionally  equivalent. 

A  third  possible  interpretation  of  the  graph  in  Fig.  3  is  to  equate  nixies  with 
2x2  interchange  boxes  and  arcs  with  links.  In  that  case.  Fig  3  would 
represent  a  size  iV  =  16  Generalized  Cube  network.  This  interpretation  will 
not  be  discussed  further  in  this  paper. 


FlCi  ?  Graphical  rcprescnlation  of  the  Augmented  Data  Manipulator  for  .V  =  8 


The  graphical  rcprcsentalitinot  thc  ADM  network  (Fig.  2)  is  shown  in  Fig. 
5  Since  there  are  multiple  paths  from  input  to  output,  this  is  not  a  banyan 
graph.  This  graph  can  be  obtained  by  adding  the  dashed  lines  shown  in  Fig. 
A  to  the  graph  in  Fig.  .T '  When  switches  are  equated  with  nodes,  the  network 
depicted  in  Fig  2  is  obtained.  When  switches  are  equated  with  arcs,  the 
network  looks  like  that  shown  in  Fig.  6  In  the  tigurc,  two  nodes  directh 
connected  by  a  solid  line  between  stages  arc  represented  by  a  single  node  in 
Fig.  3.  .Note  that  the  labels  on  end  around  connections  in  both  Fig  .‘i  and  Fig. 
b  arc  attached  to  the  same  arcs  (linksi  in  the  network.  This  second  type  ol 


'Wc  lirsi  puhlishcd  this  (ibscrvalion  in  (Xioher  msj.  in  ihc  Prisccilings  ni  the  Third 
Inlcrnalioniil  Conference  on  Dislrihulcd  Computing  .Svsicms  in  .i  prclimm.rr>  lersion  ol  this 
m.itcnal  li  was  also  discovered  independentiv  and  published  .n  (.s| 
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Fic.  6.  Implemenlalion  of  (he  Augmenied  Data  Manipulate  for  N  =  8  when  the  graph  of 
Fig.  5  IS  interpreted  with  arcs  equated  to  switches. 

implementation  is  examined  in  |43|,  where  LSI  packaging  of  network  build¬ 
ing  blocks  is  discussed. 

Though  the  same  ADM  network  is  represented.  Figs.  2  and  6  look  rather 
different.  Depending  upon  which  repre.sentation  is  chosen,  a  comparison  with 
the  Generalized  Cube  in  Fig.  I  could  produce  different  conclusions.  Com¬ 
paring  Figs.  I  and  2.  one  mirht  conclude  that,  in  addition  to  having  an  extra 
column  of  switches,  the  ADI  has  twice  as  many  switching  nodes  and  three 
times  as  many  links  as  the  Generalized  Cube  network.  It  would  be  easy  to 
decide  that  the  ADM  netwofk  is  considerably  more  expensive.  On  the  other 
hand,  comparing  Figs.  I  and  6,  it  appears  the  only  difference  is  N  extra  links 
that  interconnect  switches  within  each  stage  of  the  ADM  network.  The  latter 
comparison  is  more  accurate  because  the  network  depictions  of  Figs.  1  and 
6  are  based  on  the  same  interpretation  of  the  networks'  respective  graphs. 
Thus  when  making  comparisons,  it  is  important  to  compare  either  graphical 
representations  or  consistent  interpretations  of  those  graphs.  In  the  next  sec¬ 
tion.  the  latter  is  done  for  both  interpretations,  so  that  the  resulting  imple¬ 
mentations  can  be  compared  as  well. 


IV  Cost  Comparison 

A.  Introduction 

The  purpose  of  this  section  is  to  compare  the  cost  of  the  Generalized  Cube 
network  to  that  of  the  ADM  network.  To  do  this,  implementations  of  each 
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network  are  examined.  Two  different  criteria  are  used  in  the  comparison. 
First,  hardware  requirements  are  examined.  Since  two  basic  implementations 
are  possible  for  each  network,  to  be  fair,  only  implementations  corresponding 
to  the  same  graph  interpretation  are  compared.  Then,  since  VLSI  imple¬ 
mentation  is  being  considered,  the  total  number  of  data  pins  available  on  a 
chip  is  held  constant  and  chip  counts  are  compared  for  all  the  different 
implementations.  It  would  be  desirable  to  compare  the  gate  densities  required 
for  each  chip;  however,  that  requires  having  a  detailed  design  for  each.  In  lieu 
of  such  details,  the  attempt  was  made  to  compare  chips  with  comparable 
major  architectural  features  (e  g.,  queues)  which  presumably  require  the 
same  amount  of  logic  and  which  can  be  compared  at  a  gross  level. 

Although  the  discussion  presented  here  is  in  terms  of  integrated  circuit 
chips,  it  is  not  restricted  to  any  particular  technology.  It  is  only  presumed  that 
a  network  is  constructed  from  modular  elements  with  I/O  facilities  (ports) 
proportional  to  that  portion  of  the  network  graph  (with  an  appropriate  inter¬ 
pretation)  intersected  by  the  boundary  of  the  module.  For  example,  in  the 
future,  an  I/O  port  may  consist  of  a  laser  diode  and  a  single  optical  fiber 
instead  of  many  parallel  wires. 

B.  Hardware  Realizations 

There  are  two  basic  ways  to  implement  multistage  networks.  They  can  be 
circuit  switched  or  packet  switched.  In  circuit  switching,  a  complete  path  is 
established  from  input  to  output  and  must  be  held  for  the  duration  of  the 
communication.  Circuit  switching  is  often  used  when  processors  are  con¬ 
nected  to  the  network  inputs  and  memories  are  connected  to  the  outputs. 
Designs  for  circuit-switched  interchange  boxes  have  been  discussed  in  (11. 
26,  4.^|.  In  packet  switching,  messages  are  decomposed  into  packets  which 
each  make  their  way  from  stage  to  stage  until  the  output  is  reached.  This 
method  is  often  used  in  configurations  that  connect  processing  element 
(prtxessor/ memory  pair)  j  to  input  j  and  output ;of  a  unidirectional  network. 
Packet-switched  network  switching  element  designs  have  been  discussed  in 
(14,  26,  46|. 

In  the  remainder  of  this  paper,  implementations  will  be  discussed  primarily 
in  terms  of  packet  switching.  Circuit-switched  versions  can  be  obtained  by 
replacing  any  queues  shown  with  buses.  Other  than  this,  remaining  differ¬ 
ences  are  in  the  control  logic;  however,  the  logic  is  shown  only  at  the  block 
diagram  level.  Only  key  elements  of  the  implementations  to  be  discussed  are 
included  since  many  variations  of  the  basic  designs  are  possible.  For  more 
detail  see  [14.  26,  46|. 

C.  Generalized  Cube 

Figure  7  shows  two  designs  for  a  Generalized  Cube  switching  element. 
Figure  7a  results  when  switches  are  equated  with  nodes  in  the  graph  (this 
corresponds  to  Figs.  4a  and  3).  One  of  the  two  inputs  is  selected  depending 
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Fig.  7.  Implementation  of  Generalized  Cube  switches,  (a)  Node  =  switch  inicrprclation.  (b) 
Arc  =  switch  interpretation. 


on  the  requests  (if  any)  received  by  the  (left  half  of  the)  control  logic,  which 
handles  any  needed  arbitration.  A  single  output  link  is  shown,  but  it  is  to  be 
connected  to  tu’o  other  switches  as  shown  in  Fig.  8.  A  bit  in  the  routing  tag 
is  examined  by  the  control  logic,  which  then  determines  to  which  switch  a 
request  for  access  should  be  made.  The  (right  half  of  the)  control  logic 
maintains  the  queue,  interprets  the  routing  tag.  generates  access  requests,  and 
receives  grants  for  access  requests.  Switches  that  implement  nodes  in  column 
3  of  Fig.  3  only  contain  hardware  to  the  right  of  the  dashed  line  in  Fig.  7a. 
Switches  that  implement  column  0  nodes  only  contain  hardware  to  the  left  of 
the  dashed  line.  A  detailed  design  of  this  type  is  discussed  in  |32|. 

If  arcs  in  the  graph  are  equated  with  switches,  then  four  arcs  form  a  2  x  2 
crossbar  or  interchange  box  (see  Figs.  4b  and  c  and  1).  An  implementation 


Fig,  8.  Four  switches  from  Fig  7a  combined  to  form  one  switch  (within  dashed  lines) 
equivalent  to  that  in  Fig.  7b 


for  this  is  shown  in  Fig.  7b.  Here  two  input  queues  are  required.  As  long  as 
a  given  queue  is  not  full,  incoming  packets  for  that  queue  will  be  accepted, 
l-ogic  is  required  to  handshake  with  other  interchange  bo.xes,  maintain  two 
queues,  and  interpret  the  routing  tags  at  the  head  of  each  queue.  This  logic 
only  interprets  the  tags  in  order  to  request  the  desired  settings  for  the  multi¬ 
plexers.  Logic  asociated  with  the  multiplexers  performs  any  necessary  arbi¬ 
tration.  It  also  makes  appropriate  requests  of  other  interchange  boxes  once  the 
multiplexers  are  set  Different  protocols  and  design  variations  for  this  type  of 
sw  itching  element  are  discussed  in  |26)  The  pcrfonnance  of  networks  imple¬ 
mented  with  these  interchange  boxes  has  been  studied  m  |.L  14,  25). 

The  equivalence  of  two  networks  implemented  with  the  Iwo  kinds  of 
switching  nodes  is  illustrated  in  Fig  X.  Four  of  the  switching  elements  shown 
in  Fig,  7a  are  connected  as  prescribed  by  the  graph  m  Fig  3  It  can  be  seen 
(hat  the  hardware  within  the  dashed  lines  is  identical  to  that  shown  for  the 
interchange  box  in  Fig  7b.  The  handshaking  lines  (directed  dashed  lines) 
shown  connecting  control  units  equivalent  to  internal  connections  be¬ 
tween  the  tag  interpretation  and  queue  control  logic  and  the  arbitration  and 
output  request  logic  in  the  control  unit  of  Fig.  7b.  It  is  thus  apparent  that  the 
same  total  amount  of  hardware  is  required  for  either  implementation,  but  that 
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Fig.  9.  Implementation  of  Augmented  Data  Manipulator  switches  (at  Node 
pretation.  (b)  Arc  =  switch  interpretation. 


switch  inter- 


the  two  graph  interpretations  lead  to  different  network  building  blocks  or 
packagings  for  the  components. 

D.  Augmented  Data  Manipulator 

Two  implementations  for  the  ADM  network  are  .shown  in  Fig.  9.  Figure 
9a  results  from  equating  the  nodes  of  Fig.  5  with  switches.  In  this  design,  the 
multiplexer  selects  from  among  three  input  links  and  the  output  link  is  con¬ 
nected  to  three  other  switches.  The  control  signals  shown  on  the  output  side 
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in  Fig.  9a  are  used  to  determine  which  ot  the  switches  is  to  read  the  data  Irom 
the  output  link.  A  broadcast  is  performed  by  selecting  more  than  one  switch. 
The  basic  routing  tag  scheme  for  the  ADM  network  (28|  requires  the  routing 
tag  logic  to  examine  two  bits,  so  it  is  slightly  more  complex  than  that  required 
in  the  Generalized  Cube.  As  with  the  Generalized  Cube,  the  switches  imple¬ 
menting  nodes  in  columns  0  and  3  of  Fig.  5  only  require  the  logic  to  the  left 
and  right,  respectively,  of  the  dashed  line  in  Fig,  9a.  This  was  also  observed 
in  1151. 

If  arcs  are  equated  w  ith  switches,  an  implementation  similar  to  the  inter¬ 
change  box  is  obtained  as  shown  in  Fig.  9b.  Flere.  however,  the  outputs  from 
the  queues  must  be  connected  to  multiplexers  in  two  other  switching  elements 
(as  shown  in  Fig.  6)  via  httnistufic  huscs  .Similarly,  the  two  multiplexers 
shown  here  must  accept  connections  from  the  queues  of  two  other  switching 
elements  Two  control  signals  must  also  accompany  each  of  the  intrastage 
buses 

F.  Compunson 

An  approximate  cost  csunpanson  between  the  Generalized  Cube  and  the 
ADM  network  can  be  made  by  comparing  their  respective  switching  ele¬ 
ments.  Since  the  choice  is  arbitrary .  Figs.  7a  and  9a  will  he  compared.  Both 
require  a  single  queue  If  the  cost  of  the  queue  and  its  asscKiated  control  logic 
dominates  the  cost  of  the  switching  clement,  then  the  ADM  switch  will  cost 
only  slightly  more  than  a  Generalized  Cube  switch  in  a  discrete  imple¬ 
mentation  On  the  other  hand,  for  a  circuit-switched  implementation,  the 
multiplexer  and  control  logic  in  an  ADM  switching  element  will  cost  about 
.5()'7(  more  than  that  required  in  a  Generalized  Cube  switching  element. 

The  perspective  changes  somewhat  when  implementing  these  four  designs 
in  VL.SI  is  considered.  Input/Output  (I/O)  requirements  and  logic  pm  ratio 
become  important  considerations.  For  constructing  a  Generalized  Cube  net¬ 
work.  the  interchange  box  in  Fig.  7b  is  a  better  choice  than  the  switch  in  Fig. 
7a.  The  interchange  box  (Fig.  7b)  has  approximately  more  pins  but 
approximately  l(X)7,  more  logic  than  the  switch  (Fig.  7a).  For  the  ADM 
network,  the  logic  pin  ratio  is  nearly  the  same  for  both  of  the  designs  in  Fig 
9  The  design  in  Fig  9b  has  approximately  twice  as  many  pins  and  twice  as 
much  logic  as  that  shown  in  Fig.  9a.  The  extra  links  that  give  the  ADM 
network  its  superior  capabilities  over  the  Generalized  Cube  require  a  larger 
number  ot  pins  on  the  VLSI  chips  being  considered. 

The  design  ot  Fig.  7b  and  that  of  9a  have  approximately  the  same  number 
of  pins.  If  this  number  of  pins  (due  to  the  data  path  width)  is  near  tech¬ 
nological  limits  (and  thus  the  design  of  Fig.  9b  will  not  lit  on  one  chip),  then 
the  Generalized  Cube  interchange  box  is  superior  due  to  the  logic'pin  ratio. 
Assuming  the  cost  of  two  chips  with  the  same  number  of  pins  is  about  the 
same,  an  ADM  network  would  be  more  than  twice  as  expensive  as  a  Gener¬ 
alized  Cube  network  ot  the  same  size  (when  realized  with  these  two  re- 
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spective  chips).  The  logic/pin  ratio  t)f  the  ADM  chip  (F'lg.  9a)  can  be  im¬ 
proved  considerably  by  implementing  extra  capabilities  the  ADM  network  is 
known  to  support  (27,  28].  These  capabilities  include  dynamic  rerouting  of 
blocked  messages  and  stage  look-ahead  with  rerouting  for  blockage  predic¬ 
tion.  None  of  the  additional  features  requires  any  extra  pins  The  additional 
capabilities  are  possible  because  of  the  extra  paths  between  input  and  output 
and  thus  are  not  available  for  the  Generalized  Cube  network. 

The  cost  difference  between  the  two  implementations  of  each  network  due 
to  pin  limitations  can  be  further  quantified.  .Assume  that  one  switching  ele¬ 
ment  is  implemented  on  one  chip  and  that  the  chips  are  bit  sliced.  For  the  sake 
of  modularity,  in  the  node-equals-switch  implementation  of  both  networks, 
this  means  the  chip  will  be  more  complex  than  necessary  for  the  switching 
elements  in  the  input  and  output  columns  (3  and  0  in  Figs.  3  and  5). 

Let  D  be  the  number  of  pins  available  on  the  chip  for  data  path  connections 
and  P  be  the  number  of  I/O  ports  required  by  the  switching  element  (see  Figs. 
7  and  9).  The  DiP  is  the  number  of  pins  available  per  ptirt.  It  is  assumed  that 
data  pins  dominate  the  total  pin  count  and  that  the  chip  has  the  capacity  to 
accommodate  the  small  number  of  control  and  power  pins  also  needed  If  the 
network  data  path  width  is  W,  the  W  ■  PiD  is  the  number  of  chips  required  to 
construct  one  switching  element.  Multiplying  this  by  the  number  of  switching 
elements  needed  to  implement  the  network  gives  the  total  chip  count.  The 
expressions  for  the  chip  count  for  the  four  implementations  as  a  function  of 
W,  D,  N,  and  n  are  given  in  Table  I.  A  crossbar  is  included  for  comparison. 
The  arc-equals-switch  (interchange  boxt  implementation  of  the  Generalised 
Cube  gives  the  lowest  count  regardless  of  the  values  of  H  .  I).  and  N.  As  an 
example  of  the  number  of  chips  required  in  networks  ol  size  A’  =  16  and 
N  =  64.  assume  the  network  path  width  is  W  =  32  bits  and  there  are  a  total 
of  D  =  64  pins  available  on  the  chip  for  data  connections  The  resulting 
counts  are  shown  in  Table  I. 

TABl E I 
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As  advances  in  packaging  technology  continue,  the  cost  difference  be¬ 
tween  the  Generalized  Cube  and  the  ADM  will  narrow  considerably  until  the 
ADM  is  more  cost  effective.  To  see  this,  examine  Fig.  6.  The  larger  the 
number  of  switching  elements  (of  the  type  in  Fig  9b),  in  the  same  stage,  that 
can  be  placed  on  a  single  chip,  the  more  mtrastage  buses  can  be  internalized. 
This  reduces  the  I/O  overhead  of  the  extra  links.  If  a  whole  stage  can  be 
placed  on  one  chip,  then  the  ADM  network  requires  the  same  number  of  chips 
and  connections  between  chips  as  the  Generalized  Cube  network  The  as¬ 
sumption  here  is  that  the  chip  circuit  density  is  not  sufficient  to  support  a 
crossbar  but  it  will  accommodate  more  logic  than  one  stage  of  a  Generalized 
Cube  requires.  The  ADM  network's  structure  thus  fills  a  gap  between  the 
cube-type  networks  and  crossbars.  Until  very  large  pcirtions  of  an  inter 
connection  network  can  be  placed  on  a  single  chip,  it  is  clear  that  the  ADM 
network  will  be  more  expensive  to  implement  than  the  Generalized  Cube, 
though  the  difference  will  continue  to  decline  Thus',  it  is  important  to  deter¬ 
mine  the  networks'  cost  effectiveness.  It  has  already  been  pointed  out  that  the 
.ADM's  capabilities  are  a  superset  of  the  Generalized  Cube's.  Another  factor 
that  ic  becoming  more  important  as  the  construction  of  enormous  systems  is 
considered  will  be  discussed  in  the  next  section:  robustness  of  inherent  fault 
tolerance 


V  Ki'ttl  MM  SS  A  OiMI’VRlSON  Ol  DEGRADATION  1:NDER  COMPONENT 
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In  iho  sc..tioii,  ihc  robustness  of  each  network  is  measured  by  removing  a 
'I’l.le  vomponeiii  ilmk  or  switch)  and  counting  the  number  of  input  and 
'iitpui  )viris  that  are  aftected  .An  input  port  is  considered  affected  if  it  cannot 
send  a  message  to  all  output  ports.  An  output  port  is  considered  affected  if 
there  o  at  least  one  input  port  from  which  it  cannot  receive  messages.  Since 
the  nuriilvr  of  (nirts  affected  varies  with  the  IcKation  of  the  removed  com- 
pv'rieiu,  averages  are  computed.  Calculating  the  averages  for  the  Generalized 
Cube  netwiirk  is  relatively  straightforward.  Calculating  the  averages  f  ir  the 
ADM  netw(irk  is  complicated  considerably  by  the  varying  numbers  of  multi 
pie  paths  between  ports  of  the  network  F.xtcnsive  use  of  and  extension  to  the 
itieoretical  result  m  128|  were  required  to  obtain  the  closed  fonn  solutions 
picsented  here  To  streamline  this  presentation,  htiwever.  most  ol  the  math¬ 
ematical  ifenvutions  appear  in  the  Appendix 

The  average  number  of  aftected  ports  is  calculated  lor  bvUh  imple¬ 
mentations  of  each  network.  These  calculations  are  performed  using  two 
different  rules  tor  counting  affected  ports.  The  first  rule  requires  all  1  O  ports 
to  be  considered.  Under  this  rule,  it  has  been  shown  that  some  permutation 
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connections  can  be  routed  around  a  faulty  link  in  the  ADM  network,  but  this 
is  not  true  in  general  (38). 

The  second  rule  allows  "severely”  affected  ports  to  be  disabled  and  thus  not 
included  in  the  count.  It  is  implemented  as  follows  Referring  to  the  graphs 
in  Figs.  3  and  5,  if  a  straight  (or  horizontal)  arc  at  level  /  is  removed,  then 
input  port  j  and  output  port  j  are  disabled.  If  links  are  equated  with  arcs,  one 
pair  of  I/O  ports  is  disabled.  If  switches  are  equated  with  arcs,  since  two 
straight  arcs  are  included  in  each  switching  element  (Figs  7b  and  9b).  two 
pairs  of  I/O  ports  are  disabled.  Thus,  in  Figs.  1  and  6.  the  I/O  ports  whose 
addresses  correspond  to  the  output  labels  on  a  given  switching  element  are 
disabled  if  that  switching  element  fails. 

This  second  rule  takes  into  account  a  practical  system  response  to  a  network 
fault;  the  disabling  of  some  components  so  that  operation  can  continue,  but 
in  a  degraded  mode.  This  is  feasible  if  the  network  is  u.sed  for  asynchronous 
communication  by  cooperating  processors  (MIMD  m<xle  1 18|).  If  the  network 
is  used  in  a  synchronous  mode  to  establish  permutation  connections  (SIMD 
mode  (18))  disabling  some  components  is  not  feasible.  However,  if  the 
system  is  partitionable  so  that  subsets  of  the  processors,  called  submachines, 
operate  synchronously  but  independent  of  other  submachines,  then  certain 
submachines  can  be  disabled  when  a  fault  occurs.  PASM  (41)  and  TRAC  (34) 
are  systems  with  this  capability. 

Since  robustness  is  useless  unless  it  can  be  exploited,  it  is  implicit  that 
faults  can  be  detected  and  diagnosed  and  that  the  system  can  continue  to 
function  once  a  fault  is  detected.  Detection  and  diagnosis  have  been  in¬ 
vestigated  in  [4,  17,  33.  39|.  The  latter  requirement  implies  that  measuring 
robustness  is  only  meaningful  for  MIMD  and  partitionable  SIMD  environ¬ 
ments. 

The  results  using  the  first  rule  are  shown  in  Table  II  and  those  using  the 
second  rule  in  Table  III.  Two  examples  of  how  to  calculate  the  expressions 
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TABLE  III 
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in  Table  II  are  shown  in  the  next  subsection.  The  remaining  derivations  lor 
both  tables  are  presented  in  the  Appendix.  It  should  be  noted  that  the  entries 
in  both  tables  under  the  "Node  -  Switch"  column  lor  a  switch  failure  and 
under  the  "Arc  -  Switch"  column  for  a  link  failure  are  identical.  This  is 
because  both  situations  correspond  to  removing  a  single  node  from  the  graph¬ 
ical  representation.  Removing  a  link  Irom  the  node-equals-switch  imple¬ 
mentation  corresponds  to  removing  a  single  arc  from  the  graph,  whereas 
removing  an  interchange  box  from  the  arc-equals-switch  implementation 
corresponds  t(>  removing  four  or  eight  arcs  from  the  graph,  considerably 
different  situations. 

B  buuh  Effi'i  t  Anahsis  Countinii  Ml  Mfeewd  Ports 

Here  the  effects  of  a  link  fault  are  analv/ed  in  detail  for  the  Generali/ed 
Cube  network  and  then  for  the  ADM  network  I-or  the  former,  assume  there 
iv  a  link  failure  in  stage  /.  the  first  rule  applies,  and  the  network  is  imple¬ 
mented  by  equating  nodes  with  switches  (big  .^)  To  see  which  inputs  are 
alfected,  start  with  the  failed  link  and  move  backward  toward  the  input, 
tracing  all  links  that  are  connected  to  the  failed  one  The  nunilver  of  affected 
inputs  corresponds  to  the  number  of  traced  links  in  stage  ii  -  1  In  general 
this  number  is  2"  '  forexample.  if  the  link  at  level  4.  stage  I  (Fig  .^).  fails, 

inputs  0  and  4  are  aflected  The  number  of  affected  oulputv  is  calculated  by 
tracing  links  from  the  failed  one  to  those  to  which  it  is  connected  m  the  last 
viage.  I  e..  stage  0  This  number  is  expressed  as  2'  For  example,  failure  of 
the  link  ai  level  b.  >iage  2  (1  ig  .M.  affects  outputs  4,  .s.  b.  and  7  To  calculate 
the  average  number  ol  alfected  I  ()  ports,  given  a  single  link  failure,  a  sum 
(it  these  two  terms  taken  over  all  stages  is  computed: 
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As  a  second  example,  consider  the  case  of  a  link  failure  in  the  ADM 
network,  using  the  first  rule,  and  implemented  by  equating  nodes  with 
switches  (see  Fig.  2).  A  property  of  the  ADM  network  is  that  there  are  at  least 
two  paths  between  every  nontrivial  (input  address  output  address) 
input/output  pair  (28|.  One  of  the  existing  paths  consists  ot  plus  and  straight 
links  only  and  is  called  positive  dominant.  There  is  another  path  that  consists 
of  minus  and  straight  links  only  and  it  is  called  negative  dominant.  The 
portions  of  the  positive  and  negative  dominant  paths  that  are  distinct  depend 
on  the  relationship  between  the  addresses.  If  they  agree  in  the  low -order  i  +  1 
bits,  then  the  paths  converge  at  the  input  to  stage  i  and  folkiw  the  same  set 
of  straight  links  in  stages  i  through  0.  (The  paths  will  be  distinct  in  stages 
n  -  I  through  i  +  I .)  Thus  if  a  nonstraight  link  fails,  none  of  the  I/O  ports 
are  affected  because  there  will  be  a  distinct  path  of  the  opposite  di>minance 
that  avoids  that  link.  (Routing  schemes  have  been  proposed  that  allow  mes¬ 
sages  to  dynamically  switch  between  positive  and  negative  dominant  paths  as 
they  traverse  the  network  (27, 28],  allowing  them  to  avoid  busy  or  faulty  links 
and  switches.)  If  a  straight  link  in  stage  i  at  level  j  fails,  then  all  the  input  ports 
whose  low-order  i  +  1  bits  agree  with  output  port  /s  low-order  i  +  1  bits 
will  not  be  able  to  send  a  message  to  output  j.  There  are  2"  '  '  such  input 
ports.  The  other  input  ports  can  communicate  with  output  port  j  since  their 
paths  to  j  do  not  converge  until  reaching  a  stage  less  than  i.  No  output  port 
other  than  j  is  affected  by  the  failure.  To  see  this,  consider  output  port  k  =#  j. 
All  input  ports  must  be  able  to  communicate  with  k.  They  can  be  divided  into 
two  classes:  (1)  those  whose  addresses  agree  with  A's  in  less  than  /  +  1 
low-order  bits;  and  (2)  those  whose  addresses  agree  with  k's  in  at  least  i  -i-  1 
low-order  bits.  In  the  first  case,  either  a  given  path  from  the  input  to  output 
k  does  not  include  the  faulty  straight  link  (in  stage  /.  level  j)  or  if  it  does,  there 
is  another  path  of  opposite  dominance  that  does  not.  In  the  second  case,  in 
stages  i  through  0  the  required  path  uses  straight  links;  however,  they  are  all 
at  level  k.  Thus  all  inputs  can  communicate  with  output  k  so  k  is  unaffected. 
As  an  example,  suppose  the  straight  link  in  stage  I ,  level  4  (in  Fig.  2),  is  bad 
(/  =  1.;  =  4).  Consider  three  different  situations:  communication  from  in¬ 
puts  0  and  4  to  output  4,  from  input  0  to  output  5,  and  from  input  I  to  output 
5,  Output  4  will  be  unable  to  receive  messages  from  inputs  0  and  4,  since  0 
and  4  agree  with  j  in  the  low-order  i  +  I  bits  (2  bits).  All  the  other  output 
ports  are  unaffected.  Consider  connecting  input  0  to  output  .‘i.  Even  though 
the  positive  dominant  path,  +2\  straight,  +2',  from  input  0  to  output  5 
includes  the  bad  straight  link,  a  message  can  simply  take  the  negative  domi¬ 
nant  path,  straight,  -2',  -2".  Input  1  agrees  with  output  .*>  in  the  two 
low-order  bits  (bits  0  and  I)  and  therefore  requires  straight  links  in  stages  0 
and  1 .  However,  the  required  links  are  at  level  5.  and  thus  the  faulty  straight 
link  is  not  required. 

The  average  number  of  I/O  ptirts  affected  by  a  bad  straight  link,  under  the 
first  rule,  is  calculated  by  adding  the  number  of  affected  input  and  output  ports 
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(as  a  function  of  the  stage  in  which  the  fault  is  kKated)  and  summing  o\  er  all 
stages: 


1  "  ‘ 


1  n  I  ^ 

+  1)  =  -  Y  (2'  +  I)  =  - 


n  -  1 

II 


Since  the  failure  of  a  +2'  or  a  —2'  link  does  not  affect  any  I  O  ports,  if  link 
failures  are  equally  likely,  then  the  average  over  all  links  is  one-third  of  the 
above  value. 

In  Table  II,  the  ratio  of  the  average  number  of  affected  I/O  ports  in  the 
Generalized  Cube  to  those  in  the  ADM  is  computed  Regardless  of  network 
size,  in  the  node-equals-switch  implementation,  a  link  failure  in  the  Gener¬ 
alized  Cube  network  affects  six  times  as  many  ports,  on  the  average,  as  a  link 
failure  in  the  ADM.  A  given  switch  failure  affects  twice  as  many  ports.  In  the 
arc-equals-switch  implementation,  a  link  failure  in  the  Generalized  Cube 
network  affects  twice  as  many  ports  as  the  same  failure  in  the  ADM  network 
An  interchange  b<ix  failure  affects  1.14  times  as  many  ptirts. 


C.  Discussion  of  Fault  Effects  with  Some  Disabled  Ports 

The  measurement  using  the  first  rule  is  a  very  conservative  indication  of  the 
robustness  of  the  ADM  network.  Table  III  shows  that  under  the  second  rule, 
the  ADM  network  is  very  robust.  When  the  pair  of  I/O  ports  connected  to  the 
network  at  the  level  of  the  failure  is  disabled  in  the  node-equals-switch 
implementation,  none  of  the  remaining  ports  is  affected  by  a  link  or  a  switch 
failure  A  failure  can  only  eliminate  one  of  at  lea.st  two  paths  that  are  always 
available  between  the  enabled  I/O  ports  (as  illustrated  in  the  example  above 
connecting  input  0  to  output  5).  In  the  arc  equals-switch  implementation,  link 
failures  have  no  effect  on  enabled  I/O  ports,  because  (as  pointed  out  earlier) 
the  situation  is  equivalent  to  removing  a  switch  in  the  node-equals-switch 
implementation  However,  "interchange  box  "  failures  do  affect  some  enabled 
ports  The  reason  this  is  the  case  is  that  there  are  situations  in  which  both  paths 
between  input  and  output  ports  pass  through  the  same  box  It  so  happens  that 
these  situations  only  (Kcur  in  networks  larger  than  size  ,V  =  S  The  full  details 
are  presented  in  the  Appendix. 

The  entnes  in  Table  III  for  the  Generalized  Cube  network  are  calculated  in 
a  fashion  similar  to  those  in  Table  II  The  derivation  for  each  entry  is  given 
in  the  .Appendix 

The  interchange  box"  implementation  fault  analysis  for  the  (ieneralized 
Cube  and  ADM  networks  considers  only  the  worst  case;  i  e..  the  entire  box 
IS  faulty .  Whether  both  paths  are  acutally  bliKked  due  to  a  single  fault  in  a 
real  implementation  depends  on  the  nature  of  the  fault  Assume  that  one 
interchange  box  is  implemented  on  a  single  integrated  circuit  chip  If  two  data 
lines  internal  to  the  chip  and  coming  from  the  same  input  become  shorted,  this 
will  have  no  effect  on  the  other  internal  data  path  and  it  is  not  necessary  to 
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assume  that  the  whole  interchange  box  has  failed  On  the  other  hand,  a 
mechanical  failure  could  affect  enough  of  the  chip  to  render  the  entire  dev  ice 
unusable.  From  a  reliability  point  of  view,  this  analysis  shows  that  the 
implementation  in  Fig.  9a  (which  corresponds  to  the  network  in  Fig.  2)  is  to 
be  preferred  over  that  in  Fig.  9b  (which  corresptinds  to  the  netwurk  in  Fig. 
6).  Since  the  logic/pin  ratio  is  roughly  the  same  for  both  implementations, 
nothing  is  lost.  However,  the  total  component  count  will  be  higher,  leading 
to  a  less  physically  compact  implementation. 

The  robustness  measures  for  the  ADM  network  are  equally  applicable  to 
all  the  data  manipulator-type  networks  with  individual  switching  element 
control  since  all  the  properties  used  to  derive  them  apply  to  each  of  the 
networks.  Similarly,  all  the  measures  for  the  Generalized  Cube  network  are 
applicable  to  all  the  cube-type  networks  that  have  individual  switching  ele 
ment  control. 

The  results  presented  here  are  for  the  basic  cube-type  and  data  manipulator- 
type  topologies.  It  should  be  noted  that  variations  on  these  topologies  w  hich 
are  more  fault  tolerant  have  been  propo.sed  |2,  12,  27|. 

The  above  analysis  assumed  that  the  failure  of  one  component  was  inde¬ 
pendent  of  the  failure  of  any  other  component.  If  all  or  a  large  part  of  one 
stage  is  implemented  on  a  single  chip,  this  assumption  may  or  may  not  be 
valid.  If  it  is  not,  then  the  networks  can  be  reanalyzed  using  the  techniques 
presented  here  to  account  for  the  new  failure  pattern  exhibited 


VI  Conch  :.stoN.s 

This  paper  has  examined  two  classes  of  multistage  interconnection  net 
works  for  use  in  parallel/distributed  systems:  the  cube  type  and  the  data 
manipulator  type.  This  was  done  by  comparing  a  representative  network  from 
each  class:  the  Generalized  Cube  and  the  Augmented  Data  Manipulator 
(ADM).  This  paper  has  attempted  to  quantify  the  differences  in  imple 
mentation  costs  by  considering  comparable  implementation  models  tor  both 
networks.  It  was  found  that  a  di.screte.  circuit-sw  itched  implementation  of  the 
ADM  network  costs  approximately  50'?!  more  than  the  same  type  ol  imple 
mentation  of  the  Generalized  Cube  network.  For  discrete,  packet  switched 
implementations,  assuming  the  packet  buffer  cost  dominates,  the  two  net 
works  cost  about  the  same  amount  (ADM  would  be  slightly  higher)  It  the 
networks  are  to  be  constructed  from  VL,SI  chips,  assuming  the  network's 
building  bkK-k  chips  are  to  have  nearly  equal  numbers  of  pins,  the  .\1)M 
network  requires  more  than  twice  as  many  chips  as  the  Generalized  Cube 
Both  networks  can  benefit  from  VLSI  implementation  Fach  can  be  par 
titioned  into  complex  building  blocks  that  have  higher  logic  pm  ratios  than 
partitions  of  simple  building  bUx-ks.  Though  the  ADM  building  blixk  re¬ 
quires  more  I/O  ports  on  a  chip  than  a  Generalized  Cube  building  bhvk. 
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present  and  future  predicted  pin  capacities  are  suflicient  for  the  ADM 
network’s  needs.  Using  bit  slicing,  arbitrarily  wide  networks  of  either  type 
can  be  constructed. 

Using  a  graph  model  as  a  basis,  two  ijuaniitative  measures  of  comparative 
robustness  were  applied  to  the  networks  assuming  they  are  used  in  MIMD  or 
partitioned  SIMD  environments  Applying  the  measures  to  two  different 
(functionally  equivalent)  implementations  of  each  network  under  diflereni 
faults  it  was  found  that  the  ADM  network  is  always  more  robust  L'sing  the 
first  measure,  the  Generalized  Cube  network  varied  from  having  1  14  lo  6 
times  as  many  affected  I/O  ports  due  to  a  single  failure  as  the  ADM  network 
Using  the  second  measure,  in  which  some  I  ()  ports  are  disabled,  one  imple¬ 
mentation  of  the  ADM  network  was  shown  to  be  able  to  fully  support 
eommunicalion  aiiurng  the  remaining  enabled  I  C)  ports 

In  summary  ,  a  graph  model  has  been  used  as  a  basis  tor  quantilMiig  the 
differences  between  cube  and  data  manipulator-type  networks.  Both  imple¬ 
mentation  costs  and  robustness  have  been  compared. 

APPL-NDI.X:  Dl.kiVAMo.Ns  or  Robi  sTNhss  Ri  si  i  is 

The  following  are  derivations  of  each  of  the  results  in  Tables  II  and  III 
(excluding  those  already  presented  in  Section  V,  B:  the  average  number  of 
affected  1  ()  ports  in  the  (ieneralized  Cube  and  .\DM  networks  when  links  or 
sw  itches  tail  I  wai  different  implementations  and  two  different  rules  for 
disabling  I  ()  ports  are  considered.  Recall  that  a  port  is  affected  by  a  failure 
if  It  cannot  send  a  message  tt>  all  of  the  other  ports  or  if  it  cannot  receive  a 
message  troiii  all  of  tfie  other  poits 

1  Rule  I  When  a  link  or  a  switch  fails,  no  UO  p»)rts  are  disabled. 

A  Implcmcntatmn;  Node  =  Switch  (Arc  -  Link). 

I  A  single  link  fails.  This  case  was  considered  in  Section  V  of 
the  text 

J  A  single  switch  fails.  Under  this  implementation  there  are 
n  •  1  columns  of  switches  (see  Mgs  ?•  and  )  so  summations 
in  (ho  average  are  taken  from  0  to  it. 
a  (iencrali/ed  Cube 

St.iri  iig  at  the  tailed  switch,  trace  links  and  switches  hack 
lo  (he  input  to  detcrniine  the  number  of  atlected  inputs 
Fins  number  is  2’’  il  the  tailed  swiich  is  m  column  i 
Using  this  same  method,  but  tracing  to  the  output,  the 
number  of  affected  output  ports  is  2'  I  he  average  num¬ 
ber  affected  is  thus 


V  i2”  '  2')  -  V  2' 

■—  „  *  I  ^ 


2(2.V  -  1) 
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b.  Augmented  Data  Manipulator 

The  same  reasoning  discussed  in  the  text  (Section  V) 
applies  here.  There  are  2"  '  affected  input  ports  and  only 
1  affected  output  port.  The  average  is  thus 


B.  Implementation:  Arc  =  Switch  (Ntxle  =  Link), 

1 .  A  single  link  fails.  This  case  is  completely  analogous  to  I  A  2 
above. 

2.  A  single  interchange  box  fails.  The  effects  of  this  are  deter 
mined  by  examining  Figs.  I  and  6. 

a.  Generalized  Cube 

The  effects  of  a  failed  interchange  box  in  stage  i  are 
determined  by  tracing  both  input  links  to  the  box  back  to 
the  input  and  the  two  output  links  to  the  output.  The 
number  of  affected  input  ports  is  2"  '  and  output  ports  is 
2'*'.  The  average  number  affected  is 


I 

-  y  (2" '  +  2'*')  = 


4  ^  4(yV  -  I ) 


b.  Augmented  Data  Manipulator 

In  this  implementation,  the  straight  arcs  (from  Fig.  5)  that 
are  paired  at  stage  /  (for  the  network  in  Fig.  6)  are  p„  i 
Pi>  \PtPi  \  '  '  '  Pa  and  p„  I  ’  '  ’  Pi‘  \PiPi  I  ■  ■  '  Pa- 
The  logical  “distance"  between  these  links  is  2'  Thus,  if 
j  -  Pn  I  ■  ■  ■  P,  \^P,  1  ■  ■  ■  /’ll  is  the  address  of  the  upper 
input  to  an  "interchange"  box  m  stage  /.  then  j  +  2'  = 
Pn  \  ■  •  •  /’nil/’i  1  ■  ■  ■  Pa  is  the  address  of  the  lower 
input.  For  example,  in  Fig.  6.  the  second  box  from  the 
top  in  stage  1  has  inputs  with  addresses  4  and  6  In  binary 
the  addresses  are  />:()/?„  =  KK)  and  p:  l/>„  =  110.  respec¬ 
tively.  Notice  that  each  box  has  two  other  inputs  from 
nonstraight  links.  To  consider  all  of  the  inputs  that  possi 
bly  could  be  affected  by  the  failure  of  a  box  with  inputs 
j  and  j  +  2'  in  stage  /.  trace  links  backward  from  each 
box  input  to  the  input  r>f  the  network.  The  easiest  way  to 
do  this  is  to  use  Fig.  Start  with  the  nodes  at  levels  j  and 
/  +  2'  in  column  i.  For  the  example  above,  these  are 
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nodes  4  and  6  in  column  1  Trace  the  three  links  back¬ 
ward  to  column  i  +  I  and  mark  the  appropriate  nodes. 
Repeat  the  priKcdurc  lor  each  marked  node  There  w  ill 
be  2"  '  inputs  marked  m  column  n.  so  this  is  the  upper 
bound  on  the  number  of  affected  inputs  For  the  example, 
nodes  0,  2.  4,  and  6  are  marked  in  column  of  Figure  5 
This  translates  to  inputs  with  these  numbers  in  Fig  h 
None  of  the  other  inputs  can  be  affected  b\  the  failure  ot 
this  box  because  the>  have  no  phvsical  connection  to  it 
All  of  the  marked  inputs  are  alfccted  I  his  is  because  an> 
input  whose  low  order  /  I  bits  match  either  /  (hits 
Op,  I  ■  ■  />i,l  or  /  '  2  (bus  \p,  I  •  •  ■  f>,,)  will  require 
straight  connections  in  stages  /  through  0  ai  level  /  or 
I  ^  2‘  when  these  inputs  communicate  with  outputs  /  or 
j  2'.  I'hev  will  be  lorced  to  use  the  faullv  interchange 
box  in  stage  /  Calculation  shows  that  there  are  2"  ad 
dresses  that  meet  this  criterion  so  the  number  of  .iltcctcd 
inputs  equals  the  upper  bound 

To  determine  which  outputs  are  aficeted  requires  two 
observations:  ( I )  inputs  that  can  reach  output  /  or  /  +  2‘ 
of  the  faulty  interchange  box  can  onlv  get  to  levels  in 
stage  /  ('1  the  form j  t  hi'  nuxl  N.  h  any  integer,  and  (2) 
regardless  of  the  path  taken  in  stages  n  -  1  though  i. 
w  hen  the  path  reaches  the  output  of  stage  i.  it  must  be  less 
than  a  distance  of  2'  ti  e.,  0  to  2'  I )  ol  the  destination 
/)  Observation  (  h  is  a  result  of  the  fact  that  the  inputs 
agree  w  ith  j  in  the  /  low -order  bits  In  stages  n  -  I 

through  1  the  smallest  increment  b\  winch  a  path  can 

change  levels  is  2'  Thus  ail  the  levels  it  can  get  to  in  stage 
/  agree  with  /  in  the  i  low  order  bits  Observation  (2)  is 
a  result  of  the  fact  that  the  maximum  distance  stages 
I  -  1  through  0  can  change  a  path  is  21,  2*  -  2'  1 . 

Now  consider  five  ca>cs  regarding  the  relationship  be¬ 
tween  /).  /  and  /  2'  First,  note  that  any  interchange 

box  in  stage  II  I  that  fails  will  aflect  all  the  outputs 
since  none  of  them  can  receive  messages  froni  inputs  j 
and  /  •  2"  '  So.  assume  t  ■  ii  I 

('ijw  I  />  =  j  Output  ;  IS  the  onlv  output  from  the 

faulty  box  less  than  a  distance  I'f  2  from  /f  therefore  (as 
shown  in  the  aflected  inputs  analvsisi  inputs  that  agree 
vv  ith  /  m  the  low  order  i  bits  c.innoi  conimunic.ite  w  ith  I) 

Ctisi  11)  /  *  2  The  argument  is  the  same  as  lor 

Case  1;  thus  D  is  affected 
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Case  3.  j  <  D  <  j  +  2'.  The  only  outputs  in  stage  i 
less  than  2'  from  D  are  j  and  j  +  2'.  Thus  histh  pt>tential 
paths  from  an  input  that  agrees  with  j  in  the  low-order  / 
bits  must  route  through  the  faulty  interchange  box.  so  D 
is  affected. 

Case  4.  0  ^  D  <  j  (if  J  =  0  =  D  see  Case  I ),  If  D  is 
a  distance  of  2'*'  or  more  from  J.  it  is  completely  un¬ 
affected  because  there  is  no  physical  path  from  the  faulty 
box  to  D.  If  D  is  a  distance  of  less  than  2  ' '  from  J.  the 
only  outputs  from  the  faulty  box  less  than  2'  from  D  are 
J  -  2'  and  j.  One  of  the  paths  to  output  j  -  2‘  comes  from 
the  faulty  box.  However,  it  is  known  that  there  are  at  least 
two  ways  to  get  from  an  affected  source  to  output  j  -  2' 
in  stage  i  (which  is  input  j  -  2'  of  the  next  stage,  stage 
i  -  I).  This  follows  from  the  facts  that  (I)  there  are  at 
least  two  paths  between  every  nonequal  network  input 
and  output  (28],  and  (2)  the  only  way  to  reach  network 
output  j  -  2'  from  an  affected  source  is  to  go  through 
input  j  -  2'  from  stage  i  ~  I  and  then  ‘‘straight "  through 
the  rest  of  the  network.  Therefore,  there  are  at  least  two 
physical  paths  from  an  affected  source  to  input  j  -  2'  at 
stage  I  -  1 .  Thus,  every  affected  input  must  be  able  to 
communicate  with  D  through  the  other  path  to  input 
j  -  2'  in  stage  i  -  I.  Therefore,  D  is  unaffected. 

Case  5.  j  +  2'  <  D  N  -  1  (if;  +  2'  =  /V  -  1  = 
D  see  Case  2).  This  case  is  completely  analogous  to  Case 
4.  If  D  is  a  distance  of  2'" '  or  more  from  j  +  2'  then  it 
is  completely  unaffected.  Otherwise,  the  only  outputs  in 
stage  I  less  than  2'  from  D  are  ;  +  2'  and  j  +  2'  ‘ .  Thus 
D  is  unaffected. 

To  summarize,  if  i  =  n  -  I,  two  inputs  and  all  N 
outputs  are  affected.  For  0  <  /  <  n  -  I,  the  outputs 
affected  have  an  address  D  such  that  j  ^  D  ^  j  +  2'. 
There  are  2'  +  1  such  outputs.  The  inputs  affected  agree 
with  j  in  the  low-order  i  bits.  There  are  2"  '  such  inputs. 
The  average  number  of  affected  I/O  ports  is 


(2"  '  +  2'  +  I)  +  A/  +  2 


1 


MC  MILI-EN  AND  SIEGHl, 


Rule  2:  If  a  straight  link  or  switching  element  at  level  j  fails,  disable 
input  j  and  output  j.  If  an  interchange  box  with  inputs  j  and  k  fails, 
disable  inputs  j  and  k  and  outputs  j  and  k. 

A.  Implementation:  Node  =  Switch  (Arc  =  Link). 

I  A  single  link  fails, 
a.  Generalized  Cube 

When  a  straight  link  in  stage  ;  fails  there  are  2"  '  -  1 

affected  inputs  and  when  a  nonstraight  link  lails  there  are 
2"  ‘  '  affected  inputs  (since  no  port.s  are  disabled)  Sim¬ 
ilarly  there  are  2'  I  and  2'  affected  outputs,  re¬ 
spectively  The  average  is  thus 


y  (2' 


•)/i  ■  I 


I  +  2’ 


1-1-2  1 


V  (2"  '  1  2'  S 


2.V  2 

n 


b  Augmented  Data  Manipulator 

II  any  straight  link  at  level  /  tails,  ihe  onl\  output  some 
o(  the  inputs  cannot  communicate  with  is  output  j  Since 
it  IS  disabled  and  it  was  the  only  affected  output  (see  the 
discussion  in  Section  V).  none  of  the  remaining  enabled 
input  or  output  ports  is  affected. 

A  single  switch  fails, 
a  Generalized  Cube 

This  case  is  similar  to  case  I  . A. 2. a  except  that  there  is 
one  less  affected  input  and  one  less  affected  output.  Also, 
the  failure  of  a  switch  in  column  n  or  0  has  no  effect  on 
any  inputs  or  outputs.  This  is  because  when  a  column  n 
switch  fails,  any  input  other  than  the  one  entering  that 
switch  can  reach  all  outputs  Similarly,  when  a  column  0 
switch  fails,  (he  only  output  that  cannot  be  reached  is  Ihe 
one  attached  to  the  failed  switch.  The  average  is  thus 


I  «  I  Til 

- ;  V  ,2-^  '  .  2  2)  -  -■  V  ,2'  -  h 

n  I  "  n  -•  I  ^ 

2,\ 

- _ _  T 

//  +  1 


b  Augmenled  Data  Manipulator 

No  input  or  output  links  are  altected  I  he  reasoning  is 
similar  to  case  II  A  l  .b  If  a  switch  at  level  j  fails,  the 
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only  affected  output  port  is  that  connected  to  the  faults 
switch  by  straight  links,  namely,  output  / 

B.  Implementation:  Arc  =  Switch  (Node  =  Link) 

1.  A  single  link  fails. 

This  case  is  completely  analogous  to  case  II  .4  2  above. 

2.  An  interchange  box  fails 

a.  Generalized  Cube 

This  is  similar  to  case  I.B.2.a  except  that  two  less  inputs 
and  two  less  outputs  are  affected.  Also,  the  failure  of  an 
interchange  box  in  stage  /;  -  I  or  0  has  no  effect  on  any 
inputs  or  outputs.  This  is  becau.se  when  a  stage  n  ~  I  bj'yx 
fails,  any  input  other  than  those  entering  that  box  can 
reach  all  outputs.  Similarly,  when  a  stage  0  box  fails,  the 
only  outputs  that  cannot  be  reached  are  those  attached  to 
the  failed  box.  The  average  is  thus 

I  n-:  -y  1  : 

-  y  (2’’  '  +  2'*'  -  4)  =  =  y  (2"'  -  2)  =  - - 4. 

b.  Augmented  Data  Manipulator 

This  is  similar  to  case  I.B.2.b  except  that  two  less  inputs 
and  two  less  outputs  are  affected.  As  in  case  II  B.2.a.  the 
failure  of  an  interchange  box  in  stage  n  -  1  or  0  has  no 
effect  on  any  inputs  or  outputs  Therefore,  the  average  is 

1  "  -  I  -  VV 

-  y  (2"  '  +  2'  -  .5)  =  -  y  (3-  2'  -  .^)  =  -  3. 
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Abstract 

A  variety  of  fault-tolerant  multistage  iiitereoiiiiei  iKui  netwurks  fur  |iar:illel  pnii  I'ssiug 
systems  that  have  been  proposed  in  the  literature  are  survi  ved  lu  lvviirk  iv  faiill- 
loleraiit  if  it  ran  rontinue  to  meet  its  fault  loier.iiiee  criterion  in  the  presence  of  one  or 
more  failures  of  the  lype(s)  allowed  by  its  fault  model.  Signilii  aiil  dillerein a  s  ui  fault 
models  and  faull-tolcrancc  criteria  ex  ist  among  various  fault-tolerant  m  l  works  'I'his 
makes  direct  comparison  of  these  networks  dillicult  In  analy/ing  the  networks,  this 
paper  compares  the  various  models  and  a.ssesses  the  elfect  of  choosing  a  common  modid 
and  criterion.  Network  characteristics  such  as  degree  (if  fault  tolerance,  rmiiitig  control 
method,  and  permutation  capability  are  discussed.  The  networks  surveyed  and  com¬ 
pared  to  the  Ivxtra  .Stage  Cube  arc  the  Modified  Itaseline,  .Augmented  l)elta,  I’-neiwork, 
ivnhanced  Inverse  Augmented  Data  Manipulator,  Camma,  Paiilt-Toleranl  Henes,  and 
,y-tielworks. 


1.  Introduction 

A  niiinlier  of  fault-tolerant  multistage  interconnection 
network  designs  have  been  discussed  in  tlie  literature 
recently.  The  interconnection  network  is  an  im|i(irtant 
com|ionent  of  large-scale  parallel  and  distributed  coiii- 
piilcr  systems  since  it  is  the  mechanism  for  information 
Iratisfcr  among  the  computation  nodes  and  memories. 
Assuring  adeijuate  reliability  for  such  large  systems  is  a 
signi/icaril  task.  Thus,  a  crucial  practical  aspect  of  an 
inlerconneetion  network  u.sed  to  meet  eommuiiii  ation 
neeils  is  fault  tolerance. 

This  paper  surveys  of  a  number  of  fault-tolerant  multis¬ 
tage  interconnection  networks  which  have  appeared  in 
the  literature.  Included  are  the  Extra  .Stage  Cube  ncl- 
wiirk  (I,  2|,  the  Modified  lla.se|ine  network  2l|,  the  ,Aiig- 
iiieiiled  Delta  network  |0),  the  P-netw(irk  .'il,  liiihaiiced 
Inverse  Augmented  Data  Manipulator  jll],  the  Camma 
network  jl.lj,  the  Fault-Tolerant  Henes  network  [.'i,  17), 
aii'l  /f-networks  |14)  Only  networks  with  lo)ni|ogi('s 
itil(  iid('(l  to  provide  fault  toler.anee  are  ineliided  because 

'I'liis  research  was  9u|i(iurtc(l  by  the  U  S  Army  Itcscarrh  OHire, 
Dejiartnirnt  (if  the  Army,  uiutfr  Conlrarl  DAAf i'.’9-R'.’-K-nilll ,  ami 
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an  aim  of  this  paper  is  to  compare  these  networks  with 
the  Extra  Stage  Cube  network,  the  fault  tolerance  of 
which  is  a  consequence  of  its  topology.  Other  methods 
for  enhancing  network  reliability  such  .as  using  error 
correeting  codes  with  existing  intercoiineetion  networks 
have  been  investigated  (lOj  but  arc  not  considered  here. 

Hasic  terminology  is  defmed  in  Section  2.  .Seclinn  .1 
deseniies  the  networks  and  their  characterisl  ii  s  relevant 
to  fault  tolerance  The  networks  are  evaluated  in  .Sec¬ 
tion  1  and  compared  lo  I  he  IN  Ira  Stage  t'lilie  in  Section 


2.  Definitions 

Inlerconneetion  networks  which  c:in  rontinue,  in  at  least 
some  c.-usos,  to  provide  service  when  they  conl.iin  faulty 
eompoiients  are  known  :us  fault -tolerant.  \  network  is 
termed  ningle  fault  lolerunl  if  it  can  function  in  spile  of  a 
single  fault.  If  up  lo  i  faults  can  be  tolerated  tlieii  the 
network  is  i-fault  tolerant.  A  network  will  be  termed 
robust  if  it  can  tolerate  some  instances  of  i  faults,  but  is 
not  i-fault  tolerant.  A  fault  is  hard  if  it  is  not  of  a  tran¬ 
sient  nature.  All  faults  are  a.ssumed  hard  for  the  pur¬ 
poses  of  this  paper. 

It  is  only  meaningful  to  speak  of  a  network  as  i-fault 
tolerant  with  regard  to  a  particular  fault-lvteranee 
model  A  fault-tolerance  model  consists  of  two 
eoinponeiils.  The  first,  a  fault  model,  defines  the  nature 
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wf  all  faults  that  are  assumed  to  occur  in  a  network. 
Tlie  fault  model  for  a  given  network  may  or  may  not 
>  orrespund  c  losely  to  actual  or  predicted  experience  with 
hardware  In  particular,  fault  models  are  often  chosen 
to  have  characteristics  suited  for  performing  an  analysis, 
i  Ci  '1  if  their  char.!' tcristics  do  not  exactly  reflect  reality 
1  hi  Ml  olid  oMipirunt  is  the  fault-toleranre  cnicnon, 
il  1; dill  li  that  Iiuisl  he  nu  t  for  the  network  to  he 
11  to  I, .IK  loieraled  a  given  fault,  or  faults  Tins 
.  f  ',  from  network  to  network  due  to  dilfereiires  in  tin- 
’  ■iii  i  .i  of  whit  .oils', tutes  functioiialil >  for  a  given 
■  '  k  ’  is|,  i!i\  wiial  .iiiiount  of  degradation  from 

'  •  .  .  1 .  ' r .  1  oiidiii  II  is  allowed). 

.  '  r,.ii' '■  of  a  iieivsork  is  deleriuini  d  by  vari- 

'  i  i  1  lie  liidiiig  the  ch  isi'ii  fault  HKxlel  and  fault- 
1  ■iiin'ii!  The  (  hoii-e  of  fault  moilel,  however, 

■  h  ■  di  10,1;.. II  of  one  investigating  the  fault 
1.1.  f  1  Iittwoik  iJilTeienl  vhoices  can  lead  to 
.  I..  .  C'  Mi  lainis  for  the  fault  tolerance  of  a  net- 

-  oii.j.irlv  V  irio,. ,  .  hoices  of  fault-tolerance  <  ri- 
lii  iltiplv  dilfi'ni.l  fault  loleta'ice  capabilities 
f mil- 1' I'l'i .liice  model  is  essenti.',!  to  tinder- 
.1  .  ill, ..11. tig  'til  fault  1  li  ratii  e  .-ap.ibililies 
,i-  i  iw.  rks  Iici..usc  -liffereiit  f.iull  (oleralH'e 
r  .M  i  '•■oh  iilfe'.  HI  networks,  some  care  must 
'  II  .iif.iring  ftiuU-tolerant  networks. 

.  -ss  s  ,,,'1  all  iierf. inn  their  interconnection 
c  rc  -tiij.  .1  of  suifcAirij  cfenieiir.t  (switches) 

■ •!  .f  stall, .s  ilepi'iids  on  the  network,  as  does 
■  i  .  f  llo  ■»Vi|.  Iiit.g  elements  'I'he  SWltehing  ele- 
1,1:  I  .(  I  V  I'lnii'.i 


.1  fsetwoik  (descriptions  and  Fault- Tolerance 
M  .del., 

'  li  .er  i  •  I  ill  ihi.s  .sccli  n  fall  into  fiiiir  gen- 
ti  r,.s  The  1  vira  Stage  f'ulie,  AugDiented 

I  '■.  oil-  :  li.v.si.|.n..  and  r-iietwiifk  form  a  gr.'up 

.  •  I  lee  ■  ..  ni  r.'doted  t’lihe  network  jl.d,  Mi).  The 
.1  i,'  .  f  il.  a.  l.ieve  f.iiiit  tolerance  by  .adding  an 

i".  :  iic'  .  '  s-'itihi.s  to  a  ba-.c  network  wh.nh  la  r-o- 

.;  I  ih'  <  e  :iora!i/i '1  (  ul.e  The  f -netw..)ik  guns 
ill';  t  I  I  .11' e  h)  n  ii.g  a  tiencralired  Cube  network 
'  r  :  ,  i V.  ,1 1,  aildit  n  .nai  links 

!  ‘  i' I  n..i'iipid  it.  r  (S[  efa-o  of  net  works  is 

II  ;  I.  ':t,  !  hy  ih'.  I'niiaiued  Inverse  Augineated  l)ata 
S' ...'I  'll  f  (lAlAI)  and  (iaiiiiiia  networks  The 
1  i.i  iiii  '  l\i)Ai  ii'twoik  Uses  additional  links,  and  the 
‘  .  .I'lii  a  I  'work  UM’s  iiicrea.sid  switching  element  coiii- 
[.hv.i)  i"  reali/,e  fault  tolerance. 

Ihe  I  111  11- 1  oh  rant  Hi  lie’  network  is  a  lliird  type  of 
'Ok  II  11,0  s  Cn  1  stages  of  switchnig  elements  and 
>11  id'l  II  ii,il  sv.  .tilling  element  to  provi.fe  fault  loler- 
n  '  .  I'll'  O'  I  'o  l!i.‘  n  stages  of  switches  in  a  (ieiieral- 
i.'il  (ell  wh.re  \  -  u"  IS  (lie  iiumlur  of  inputs  The 
f.  .nil  ..ii'goiy  of  network  is  represented  by  /f-netwurks 
’!  I. !  -  f  oiidy  of  III  I  Works  sp.iiis  a  w  ide  range  of  lo|.<.|. .gies 
iiel  piiir  fault  t  iirrance  Itiriuigh  an  operational  tech- 

lll'i'je 


1  ig  f  The  Lxira  Stage  Cube  network  with  ,\-8 

3.1  Extra  Stage  Cube 

The  Fxtra  Stage  Cube  (ESC)  |I,  2)  is  formed  from  the 
Generalized  Cube  by  adding  .an  extra  stage  of  switching 
elements  along  with  a  number  of  multiplexers  and 
dciiiuitipii  xers  Mius,  the  ESC  has  relatively  low  incre¬ 
mental  Cost  over  the  Generalized  (  ube  network,  ESC 
network  structure  is  illustrated  in  Fig.  1  for  N=8 

Each  stage  of  the  I'dSC  contains  S/‘l  interchange  6u/cs, 
or  2-iiiput/2-i)utput  switches.  Let  the  upper  input  and 
output  lines  of  an  interchange  box  be  labeled  i,  and  the 
lower  lines,  j.  Then  the  »lra$ght  setting  connects  input  i 
to  output  1  and  input  j  to  output  j.  The  ttekangt  set¬ 
ting  connects  input  i  to  output  j  and  input  j  to  output  i 
.A  broadea^l  eoiinects  an  input  to  both  interchange  box 
outputs  I'JSt;  switching  elements  are  capable  of  straight 
and  exchange  connections  and  broadcasts  from  either 
input  to  both  outputs. 

The  connections  between  stages  in  the  ESC  are  based  on 
(he  cube  interconnection  functions  (18).  Lei 
“  Pn  I  PiPo  •■be  binary  representation  of  an  arbi¬ 
trary  I/O  line  label.  Then  o  cube  interconnection  func¬ 
tions  can  be  defined  as 

C'‘l"‘,(Pe-l  PlPo)  =  Pn  1  P,  +  lPiPi  I  PiPg 
wlicre  0  <  1  <  n,  0  <  P  <  N,  and  pj  denotes  the  com¬ 
plement  of  [I,  This  means  that  the  cube  interconnection 
function  councils  to  cubp,(P),  where  cube,(P)  is  the 
I/O  line  whose  label  differs  from  P  in  just  the  i'*"  bit 
position 

•Stage  1  0  <  1  n  of  the  ESC  topology  contains  the  cube, 
inter. .nineci ion  function,  i  e  .  it  pairs  I/O  lines  whose 
addres.ses  differ  in  Ihe  i^**  bit  position.  It  is  the  only 
stage  which  can  map  a  source  to  a  destinalion  with  an 
address  different  from  the  source  address  in  the  i‘^  bit 
position  When  an  interchange  l>ox  in  stage  i  is  set  to 
exchange,  the  data  items  input  to  that  interchange  box 
arc  transferred  as  sperified  by  the  cube,  interconnection 
function  When  set  to  straight,  data  items  input  are 
transferred  according  to  the  identity  function,  where 
identily(i'„  ,  p^)  -  p„  ,  pQ  Since  each  interchange 

box  IS  individually  controlled,  each  stage  i  may  perform 
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the  oiibPi  intcrconnoriion  fiinrlion  on  all  or  some  subset 
of  the  data  items  depending  on  the  settings  of  the  inter¬ 
change  boxes.  The  extra  stage  of  the  KSC.  stage  n,  is 
placed  on  the  input  side  of  the  network  and  implements 
the  cube0  miorconnrclion  function.  Thus,  there  are  two 
stages  in  the  ESC  which  can  perform  cubeQ. 

Stage  n  and  stage  0  ran  each  be  enabled  or  disabled 
(bypassed).  A  stage  is  enabled  when  its  interchange 
boxes  are  being  used  to  provide  interconnection.  It  is 
disabled  when  its  interchange  boxes  are  being  bypassed 
Enabling  and  disabling  in  stages  n  and  0  is  accomplished 
with  a  demultiplexer  at  each  box  input  and  a  multi- 
(ilexer  at  each  output.  All  demultiplexers  and  multi¬ 
plexers  for  stage  n  share  a  common  control  sigtial,  as  do 
those  for  stage  0.  Fig.  2|a)  details  an  interchange  box 
from  stage  n  or  0.  The  demultiplexer  and  multiplexer 
are  configured  such  that  they  either  both  connect  to 
their  box  (enable)  or  both  shunt  it  (disable)  as  shown  in 
Fig.  2(b)  and  2|c),  respectively. 


Fig.  2  (a)  Detail  of  interchange  tiox  with  multiplexer 
and  demultiplexer  for  enabling  and  disabling,  (b) 
Interchange  box  enabled  (c)  Interchange  box  dis¬ 
abled. 

Stage  n  and  0  enabling  and  disabling  is  performed  by  a 
system  control  unit.  Normally,  the  network  will  be  set 
so  that  stage  n  is  disabled  and  stage  0  is  enabled.  If 
after  running  fault  detection  and  location  tests  a  fault  is 
found,  the  ESC  is  reconfigured.  A  fault  in  a  stage  n  box 
reipiires  no  change  in  network  configuration;  stage  n 
remains  disabled,  and  the  fault  isolated  If  the  fault  is 
in  stage  0,  stage  n  is  enabled  and  stage  0  is  disabled 
Stage  n  then  performs  the  function  of  the  disabled  stage 
0  For  a  fault  in  any  link  or  in  a  box  in  stages  n~l  to  I. 
both  stages  n  and  0  will  be  eii.ibb'd.  I'itiabliiig  both 
slagi’  II  and  0  provides  tolerance  to  this  type  of  fault  by 
providing  two  paths  between  any  source  and  destination, 
only  one  of  which  can  contain  the  existing  fault 

Ifouling  in  the  ESC  is  carried  out  using  rouhiig  lag.i  (t)]. 
Kouting  lags  lags  for  the  ESC,  which  lake  full  advan¬ 
tage  of  its  fault  tolerant  capabilities,  can  be  easily  com¬ 
puted.  The  ESC  uses  n  +  1  bit  routing  tags  where  the  i'** 
bit  position  controls  stage  i.  The  routing  lag  for  the 
fault-free  case  is  given  by  T'  =  t^l^  |...t,lo,  where 


t„  ,...t,l0  =  T  =  S©D.  In  the  case  of  faults,  bit  posi¬ 
tions  D  and  0  of  the  tag  T'  may  need  to  be  altered,  so 
actual  tag  values  depend  on  whether  the  ESC  has  a 
fault  as  well  as  source  and  destination  addresses,  but  are 
readily  computed  12].  .At  each  stage  i  the  switching  ele¬ 
ment  examines  the  i***  tag  bit.  If  the  bit  is  a  0,  the 
^.wltch  is  set  to  straight;  if  it  is  a  1,  it  is  set  to  exchange. 

The  fault  model  for  the  I'Sf  assumes  both  switching  ele¬ 
ments  and  links  can  fail.  However,  the  input  and  out¬ 
put  ports  and  the  multiplexers  and  demultiplexers 
directly  connected  to  the  ports  of  the  ESC  are  always 
a.ssumed  to  be  functional  If  a  port  or  the  stage  n 
demultiplexers  or  stage  0  multiplexers  were  to  be  faulty, 
llieii  the  iLssociated  device  would  have  no  access  to  the 
network.  The  fault-tolerance  criterion  for  the  ESC  is 
retention  of  full  access  capability  [5].  Full  access  capa¬ 
bility  is  the  ability  to  connect  any  given  input  to  any 
output  I'nder  its  fault  model  and  fault-tolerance  cri¬ 
terion  the  ESC'  is  single  fault  tolerant  and  robust  in  the 
face  of  multiple  faults  |2J. 

3.2  Modified  Baseline  Network 

The  Modified  llaseline  network  |2ll  is  derived  from  the 
Hiuseline  multistage  interconnection  network  |20|.  The 
baseline  network  has  but  one  path  between  any  source 
and  destination  Thus,  any  network  component  failure 
Will  affect  communication  for  some  set  of  inputs  and 
outputs.  To  lessen  this  difficulty,  an  extra  stage  of 
switching  elements  is  added  to  the  Baseline  network. 
Fig  3  shows  the  .Modified  Baseline  network  and  indi¬ 
cates  the  original  Baseline  network  and  the  additional 
stage.  The  Modified  Baseline  network  is  similar  to  the 
ESC  except  no  bypassing  of  input  or  output  stages  is 
provided.  If  an  extra  stage  incorporating  switching  ele- 
iiieiits  with  I  outputs  is  added  at  the  input  side  of  the 
network  then  there  are  t  connection  paths  belwet  n  any 
I/O  pair  (211 

Kouting  111  the  Baseline  network  is  carried  out  using  ties- 
Itnalion  lags  jO]  which  consist  of  the  address  of  the 
intended  desliiialioii  of  a  me.ssage.  If  rxr  switches  are 
used  (r— 2  for  Fig.  3)  then  a  destination  address  I)  e.iii 
be  represented  by  a  base-r  number  d,„  ,..  <l,du  where 
III  =  |og,N  This  base-r  representation  is  used  to  si  leei  a 
path  through  the  network  in  the  following  way  The 
switching  element  connected  to  the  source  will  ii-e  it- 
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Fig.  3  The  .Modified  Baseline  iielwiirk  with  N  S 
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output  iiuiiiberi'd  cl„,  ,  to  link  to  a  DWitcliing  olciin'iit  in 
the  ni-xt  btage.  At  stagf  i,  d,  is  used  to  duturiiiiiif  tlu- 
suliilioii  of  switch  output,  0  <  1  <  m-1  For  Ihc 
Modilicd  Baseline  network  an  extra  digit  can  be 
ai>pended  to  the  defined  destination  tags  to  control  the 
I  X  Ira  stage 

d  he  fault  nil  del  for  the  Modified  Baseline  assumes  only 
sail! dung  elements  not  in  the  input  or  output  stages  fail 
Faiih;.  switiiies  are  considered  iiiiusahle  The  fault- 
t'  ier.iiice  (  riterK  ii  is,  as  for  the  liSC,  full  aciess.  The 
.M  diiied  itii.adiiie  network  is  single  fault  tolerant  .md 
ti'is!  !!,  ;li,'  presen:,  of  iiiultiph-  f.uilts  with  respei  t  1.. 
f.i  ii  (ol.'raiice  nio.lel 

3,.'!  Augmented  Delta  Network 

's.,  e'igiin  lied  lAll  i  iietyyork  [ti)  is  dlu,',lraled  by  Fig  I 


t::,] 

1  I  -L..d 


I  H-" 

J _ K-  te 


I  [  '  .  ,1.  I  .|i  •!  y  I  wni  a  l»  I  X  ll  » 

(  I  .\ii  \iigii  .  iiled  Delta  network 

|.  ,,'i  '•  111'  oi,.',i:i,il  Delt.i  iietysork .  label,  d  D 

'i.rer.'  1,^1.!  .  .if  i.ig.,  ■  f  N/b  bxb  switching 
111  '  i.ie'iiiil  I"  t  ileli.i  nit.vorks  e.ieh  of  sire 

si"  ■  I'h.'s..  1,  li.'lw.irks  ale  libeled  Dy  through 

;i  '  Ill  I  .1' '  '  I'  h  sir  1.  lured  in  the  same 

i;  '  “  A  b''  .,l,.;k 

■  ■  e.  ilv  bis  a  'll  gl.  |.  ||i  !..•l^^len  any  g.vrii 

!  ipii  In.  \  ugi...  Delta  nciVM.rk  is 

■  I  ".i  M'.'.ilii.l  il.is.'biie  ii.iw.irk  'Idle  disliiM 

,.  "'.e  die  iwo  Is  that  the  .kiigiiieiited  llelta  net- 

nr.  ic'.  niori'  ih.iii  one  extra  stage  and  the 
'  's  i'  no  i  Xira  sl.ig.  r-  '  an  i.lenlical  to  all  others 
.i.r.  'I.  pi;  \.!!ieiii,il  t' diiii'l.inl  i.aihs  i  .in  be 
'.'d  1,1  ...I  bug  III. ire  si.igrs  So  that  this  nclwork 
i"  'np.r.ibi.'  I'llli  Ibe  olli'r  in  tworks  discussed  in 
ipe'  .nil  ..lie  'l.ig.  is  I'siiine.l  to  be  ail. led 

1  .cn.i  ill.  I  D.'Im  ii.  lwi-rk  IS  siind.ir  l.i  the  IX  It 
!s  e-.iig  by  ,  a. I,  hu  g  I'ieliients,  where  the  IX' 
J.  .  'Wiotii,  ,iiid  ihii'  1  .III  haxr  more  palhs 
g.'-.'ii  iiijiiit  tiinl  any  .iul[iul  lioweyer  like 
'  '.I  H.i  .'lir..'  then  is  in.  bypas.sing  ..f  inp.it  .-r 

■I  'I  e,", 

\  , .■ll..■ll ' '  I  Dell.i  boill  III  . del  in.  . .rp.irat.  '  Hie 
n  p "  I  i:  :il  ill  fiiiil '  .1  (  II  r  in  h.  .1  h  sw  It  hllig  eli- 

■  in  I  link'  i'J!  'l.igi  0  .ind  n  swiii  liing  ileiiietils 
b-.ii'  'oil'  free  |.'t|  I. mils  ....i.r  indi  [len.leiilly  .ill  I 
'  liiu  ■  r  '.wlrtung  .  biin  iiis  are  not  .ii  aibible 


for  use.  The  fault-tolerance  criterioo  is  defined  as 
retaining  full  access  capability  Under  such  a  fault- 
tolerance  model,  an  Augmented  Delta  network  con¬ 
structed  from  2x2  switches  is  single  fault  tolerant  If 
bxb  switching  elements  are  used  throughout  then  the 
network  hs  (b-l)-fault  tolerant 


3.4  F-Network 

The  F-nctwork  j.aj  coiinecls  N-j"  inputs  to  outputs 
yia  n +•  1  stages  of  switching  elements  which  are,  in 
general  4-iiipul/ l-oiitpiit  deyices  that  cenneet  on.  inj.nl 
to  one  output  .A  switehing  clement  in  stage  j.  Pi.  is 
d. noted  by  a  bit  string  P,=p  ,  PjPo  It  is  connected 
to  the  stage  j  +  l  switching  <|iniints  P^^.,=|jj  ,  |i|j)g 
b2,t|~Pn  1  I’jslb'jb'j  1  I'lPO' 

•^si-f'ni  P|-nl’.P,  1  PiPO'  .‘"d 

^c|  I\.  I  P.cll'iPj  I  PiPo  l''S  ^ 

network  for  N  -^8.  Stages  are  numbered  from  left  to 


i'lydci'-' 


I  ^  J  i  lVac  aa- 

-H-'  'V 

I  ip  ’ 

N  j  3  k  '.*»/■"  »  3  krM.X.^ 

P  k  -'-/y 


Stage  0123 

Fig  .3  1  he  F-network  for  N -8  [5] 

right  ranging  rr..iii  0  to  n  .and  within  each  stage,  switih- 
ing  elemenis  are  numbered  from  0  to  N-| 

Idle  F-neiw.irk  c.iiitaiiis  tlie  structure  of  the  (ieneralizej 
<  ube  nem.  rk  .m  l  ran  emulate  it  using  only  the  P,  ♦  i 
and  Q,  1- 1  .  '  iinec  tions  riiiis  the  fault  toleraii  e 

.i|.|.r.i.irli  ..f  the  F-nctwork  is  to  add  links  iH.  +  i  and 
S^,,)  1.1  ihe  ( .1  nerali/'ul  (  iibe  structure,  unlike  the 
liSC  Modifi.  .1  lla'idine  an. I  Augmented  Delta  networks 
|{..uiiiig  in  the  r  nel  '..rk  is  a'.omplished  through  the 
li  e  if  roiilmg  l.ig'  'I  hi  alg'  rilhin  used  to  ealrulale  the 
lap-  pr.. tides  f.  r  the  cle.iee  of  tw.i  of  the  fi.ur  outjjul 
link'  at  any  'Wii.-hing  .■'.•meiil  (ex  ej-l  hr  an  output 
'i.igi  'VMl.'lii  bins  .illows  the  fault-ti.leraiire  rajia- 

biiilii  ..f  Itie  I  n.  Iw.iik  be  r.'alued 

bhe  fault  iii  'd.l  i.'.-l  f  r  111.'  I -iielvy.  rk  a  'umcs  |l| 
faults  i.e  ur  ..'Illy  111  'W  il :  liiiig  ei.  III.  111'.  (2|  '  lage  0  and  n 
swilil.ing  .bill' lit'  are  ,.lw,i\'  f.oill-free  (3)  faults  oerur 
ind.  I.elideiilly  .ml  |l)  i  i,  li  f.u.ll  pretenls  the  e.rred 
exerulion  <.f  .m\  'Wileliinp  el.  ineiil  funeti  n  s,.  a  faulty 
sysiiehing  e!<  nil  M  !>  lotally  iiiijyailabie 


Tbr  K-nvlwork  i»  considvrcd  to  tolerate  faults  as  long  as 
every  input/outpul  pair  ran  rommunirate.  Thus,  the 
fault-tolerance  criterion  for  the  F-nctwork  is  retention  of 
full  access.  The  network  is  single  fault  tolerant  and 
robusi  in  the  presence  of  multiple  faults  with  respect  to 
its  fault-tolerance  model  |5|. 

3.6  Enhanced  Inverse  Augmented  Data 
Manipulator  Network 

An  Inverse  Augmented  Data  Manipulator  (lADM)  net¬ 
work  is  an  Augmented  Data  Manipulator  (ADM)  net¬ 
work  |16|  with  the  order  of  stage  traversal  reversed. 
The  .\DM  is  derived  from  the  data  manipulator  network 
|S).  Fig.  6  shows  the  LVDM  for  N=8  It  consists  of 


n  I 


STAGE  0  I  2 

Fig.  6  The  Inverse  Augmented  Data  Manipulator  net¬ 
work  for  N=8. 

n  =  logjN  stages  of  N  switching  elements  and  3N  links 
that  are  connected  to  the  succeeding  stage  Each 
switching  element  connects  one  of  three  inputs  to  one  of 
three  outputs.  Specifically,  at  stage  i,  0<  i  <  n.  the  out¬ 
puts  of  switching  element  j.  0<  j  <  N,  are  connected  to 
switching  elements  (j-2')  mod  N,  j.  and  (j  +  2')  mod  N 
in  stage  i  +  1.  These  links  are  known  as  the  minus, 
straight,  and  plus  links,  respectively.  Since  (j  -  2"  ')  is 
congruent  to  (j  +  2"  ’)  mod  N,  there  are  actually  only 
two  distinct  logical  data  paths  from  each  switching  eh*- 
nicnt  in  stage  n-1  (stage  2  in  Kig.  0).  There  is  an  addi¬ 
tional  set  of  N  switching  elements  at  the  output  stage. 

In  1 11]  performance  and  fault  tolerance  enhancements  of 
the  L-VDM  are  discussed.  The  fault  model  for  the 
Enhanced  I.VDM  network  is  the  same  as  for  the  ,\ng- 
meiited  Delta  network.  The  criterion  for  tolerating  a 
fault  is  also  the  same. 

One  method  of  providing  fault  tolerance  with  the  I  \I)M 
is  adding  reduiidaiit  straight  links.  This  allows  the 
bypass  of  a  faulty  straight  Itttk  by  using  the  alternate 
straight  link  Faulty  plus  or  ititniis  htiks  can  he  avonled 
by  taking  the  alternate  path  available  at  the  stage  just 
prior  to  the  faulty  link  |ll|.  However,  switching  element 
faults  cannot  be  tolerated.  Houting  for  the  l.\l)M 
enhanced  with  straight  links  is  exactly  the  same  as  for 
the  LVD.M  network  and  is  performed  with  routing  tags 
(111. 


I 

N 

I* 

It 

T 


<> 

U 

T 

H 

U 

T 


Stlairr  0  I  2 

Fig  7  The  ITihanced  Inverse  .\iigmentcd  Data  Manipu¬ 
lator  network  with  half  links  for  N=8. 


-V  second,  more  elfeclive  inodilleation  to  gain  fault  toler¬ 
ance  is  to  add  half  links  to  each  of  stages  1  through  n-1 
Half  Itnka  connect  a  switching  element  m  in  stage  i  to 
swill  hing  elements  (m+2‘  ')modN  and 
(iii-2‘  *1  mod  N  This  is  shown  for  N'=8  in  Fig.  7. 
.Adding  half  links  (irovides  single  fault  tolerance  to  any 
switching  element  or  link  failure  This  is  because  at  any 
switching  element  (except  those  in  stage  n-1,  the  last 
stage)  along  a  path  from  a  network  input  to  output 
there  are  a*  least  two  (sometimes  three)  links  leading  to 
distinct  switching  elements  in  the  successive  stage,  each 
of  which  can  be  used  to  satisfy  the  overall  routing  need 
|ll|.  Again,  routing  tags  as  in  the  lADM  can  be  used. 
Significant  switching  element  logic  is  required,  however, 
to  interpret  the  routing  lags  and  allow  their  dynamic 
modification  to  achieve  the  full  fault-tolerance  capabili¬ 
ties  of  the  network.  This  makes  these  switches  candi¬ 
dates  for  V'LSI  implementation.  With  a  single-stage 
look-ahead  technique  the  network  becomes  two-fault 
tolerant  111]  Thai  is,  messages  will  not  be  sent  along  a 
route  on  which  all  alternative  paths  to  the  next  stage 
are  blocked  by  the  two  faults.  Further  modifications  of 
the  hardware  enhancement  methods  given  above  are  dis¬ 
cussed  in  |ll|. 


3.6  Gamma  Network 

The  (Jaiiiiiia  iielwiirk  jldj  is  adajited  from  the  l.VDM 
iK'twork  (sec  Fig  bl  and  has  redundant  paths  connecting 
.\  =  2"  inputs  to  N  outputs  It  consists  of  n  stages  of  N 
switches  However,  unlike  the  I.-VDM,  each  of  tlie 
switching  eUineiils  is,  in  general,  a  3-inpiil /3-oul put 
crossbar  switch  msltv'il  of  a  onc-of-Hiree  inputs  to  one- 
of-lhr<T  outputs  selector  Switching  elciiu'iils  in  the 
input  stage  have  only  one  input  and  three  oiit(iiits,  whtle 
output  stage  switches  have  three  inputs  aiul  only  one 
oiilpiit  The  connection  p.iiii  rn  <‘ti.ihlistie(l  hy  the  links 
IS  identical  to  that  of  the  lADM 

In  general,  the  number  of  paths  I’„  between  an  input,  or 
source.  S.  and  an  otilptii.  or  destination,  1),  in  an  n  stage 
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elwork  is 

Pn-ily  modN),  X  even 

mod  Nj  +  P„  iiivd  Sj.  X  odd 

m  Z 

wlirre  x  =  (U-S)  mod  N,  l'i(O)  =  1,  and  P.(!)  --  2 
Note  tlial  P,,(0)  =  1  /or  all  n.  Tlie  fact  that  >  1 

for  X /O  IS  the  source  of  fault  tolerance  in  the  Catnina 
lie  I  work 

1  I'l  (lUiinia  network  '  an  he  eontrolli^d  hy  n  digit  rout¬ 
ing  tag,',,  tile  value  of  which  is  the  difference  mod  N 
Ik  Iv.  (  cii  I  ill-  niiniber.s  of  the  network  input  and  output 
l!'  he  ■  o'lnei'ied.  The  digits  of  the  tag  may  be  I,  0,  or 

I  corresponding  to  the  +  2‘,  straight,  and  -2‘  links, 
nsputively  Control  of  the  Gamma  network  when 
f.ii.Us  occur  IS  not  explicitly  specified  iii  (idj 

A  fault  model  that  can  be  used  for  the  Gamma  network 
assumes  (J)  faults  occur  only  in  switching  eleiiieiits,  (2) 
ihe  iiiiuit  and  output  stage  switching  elements  are 
alw.iya  faull-free,  (3)  faults  occur  independently,  and  j-1) 
r.u'.ty  'W  idling  elements  are  not  available  to  pass  infor¬ 
mation  f  ile  fault-tolerance  criterion  appropriate  for  the 
(i  iio  iiz.  network  is  full  access  without  the  stipulation 
that  .n  input  be  able  to  connect  to  the  identically  num- 
i.trui  output,  as  there  ts  only  one  way  to  perforin  this 

■  i.m.tion  Under  this  fault-tolerance  model  the  net- 
w  .  rk  i,.  single  fault  tolerant 

■i.'l  !■  sult-Toler«At  Benea  Network 

t  •■  ‘.  •.'(‘5  cel  work  |4)  connects  N-2"  inputs  to  N  outputs 

■  i  (  .1  1  stages  each  with  N/2  2-input/2-output  switcb- 

r  eh.' . .  The  switching  elements  can  be  set  to  one 

.it  two  st  iles  straight  or  exchange.  The  Belies  network 
,  r\irru-gcalte  in  that  any  idle  input/oulput  pair  can 
'  •  •  '  'u  c'.ed  by  rerouting  any  established  one-toonc 
..onecuoii  i  ao  jt  cessary.  In  other  words,  any  one-to- 
■  .  niv  1,011  can  be  established  regardless  of  any 

.  IIP  conuect'ons  Thus,  the  Henes  iiet- 
■’  iM  p'  hjru.  aiiy  permui i.l ion  of  inputs  to  outputs 

i  .  J,  .1  'io  i  mill- i  oil  I , Oil  Iv  r.es  network  Is  [mt- 

'  r  ‘  !  oc  t!o  m  ce,.-,ary  .ctlings  of  all  the 

'  hiM'.;  suiu'iils.  .111'!  then  unposnig  that  .stale  on  ibi- 
I  '  Moii^ti  .  o'ltiol  liii.  ■  u'.e  per  swilch  Till  c.'iii- 
;  'o  ..  0  r.  (11  r.  s  liioi'  pr"i  orl  lon.il  In  (lie  iiiiiuler  ..f 
il.d  f.uilv  -Hi'ihe'  ire  iiol  awinled  |l7j  The 
■’  j'.  1'  1  >wii,.li  -eiiings  c.iii  Ui  adjii.ted  lo  iiiateli  l|ie 
'  f  sl'i  ’k  swndi  I  nulls  -witidies  mu,  I  'm-  used 
'  r  ij'o'i.uis  sr-  0)  1.'  I  erforriicd  in  'Uilv  one  pU's 

,:'i'  I  he  Il'  tlS'  In 

il  (ill  1,1  .il' I  u-ed  III  17;  !,,r  il,!'  .ii,..!,  .d  f.o.i! 

1  o  (  ll,--  Hi  lies  liells  r  ii  I-.  ,i  sSlioluilg  'lillieMl 

I',  k-fiiiii  lie  , lei  iJi.il  IS  a  s.vil'hing  eieiiii  UI  I'.iii  1- 
'  1  k  n  !:•-  ‘iv.iiiihl  e'liiig.  -.r  sl'uk  ill  the  -s-loinge 

II  e  u  1  ’ll  .  Ill  r.  r.nih  are  usMIllied  lo  i  .  i  iir  .  Ills  III  I  lie 

aili  liiiig  ileiiienls  of  the  nelssorh  lo  oei  iir  indepen- 
i  iiM  I'll  I'  hi  h.ird  I'm. ills,  fuilly  sssii  limg  rle- 
e  U  S  ire  llllo'.leU  O  .1  il.i  ;no|  |||,,  ji,i(||  ,,  ,,i|||  iM, 

1  .1.,  uinl  Uir  111  'I. lie  of  die  siu  k  'Ssi:,|, 

1  I  iS  1  i  rdilisely  1.1  ik  fault  mo.b  1  m  ih.it  ii  supiins,  . 

2  -J 


Gamma 


an  optimistic  view  of  hardware  behavior.  For  example, 
other  switching  element  failure  modes  may  well  be  possi¬ 
ble,  such  as  ones  where  cniiliiiucd  use  of  the  switching 
element  is  not  possible.  Link  failures  may  also  occur  in 
a  physical  network. 

The  Benes  network  ran  tolerate  most  single  faults,  as 
defined  by  the  above  model,  where  the  fault-tolerance 
criterion  is  rei aiding  the  ability  to  perform  any  permuta¬ 
tion  rounection  in  a  single  pass  through  the  network. 
This  IS  also  known  as  full  connection  capability  [17|.  It  is 
the  most  stringent  fault-tolerance  criterion  of  the  net¬ 
works  con.iidcred  but  the  Benes  network  is  the  most 
capable  of  all  the  networks  considered,  in  terms  of 
perinutiag  capability  Some  niullipic  switching  element 
faults  not  in  the  center  stage  can  be  tolerated  as  well,  so 
the  network  is  robust  However,  if  any  single  switching 
element  in  the  center  stage  is  stuck  at  the  exchange  set¬ 
ting  then  the  identity  permutation,  which  connects  each 
input  (o  the  identically  numbered  output,  cannot  be  per¬ 
formed  Abo,  if  any  center  stage  switching  element  b 
stuck  at  the  straight  setting  then  the  uniform  shift  con¬ 
necting  each  input  i,  0<i<  N,  to  output  N/2  mod  N  is 
not  possible 

-Any  renter  stage  fault  can  be  corrected  by  a 
modification  that  involves  adding  a  single  switching  ele¬ 
ment  at  the  input  or  output  stage  |17).  The  Benes  net¬ 
work  without  modification  can  tolerate  a  switching  cle¬ 
ment  stuck- at  fault  at  all  but  the  center  stage.  The 
addition  of  the  single  switching  element  overcomes  this 
difficulty.  The  configuration  of  the  fault-tolerant  net¬ 
work  with  the  extra  switching  element  at  the  output  is 
shown  in  Fig.  8  for  N=8.  Tolerance  of  a  fault  b 
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.uhii  vcd  by  ii.iuig,  ihc  extra  swiiehing  element  to  rorrei-t 
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Fig.  tt  A  simple  single-stage  ^-network. 

sidered  in  this  paper  This  provides  a  more  complete 
view  of  the  state  of  the  art  in  fault-tolerant  multistage 
networks. 

A  /^-network  is  defined  as  having  the  dynamie  /ull-aeeeas 
property  if  each  network  input  can  be  connected  to  each 
network  output  in  a  finite  number  of  passes  through  the 
network.  Between  passes  it  is  assumed  that  each  output 
can  connect  to  its  corresponding  input  (i.e.,  the  input 
with  the  same  number  as  the  output)  via  a  path  outside 
the  network.  The  ^network  is  said  to  tolerate  a  fault  if 
(lie  fault  does  not  destroy  dynamic  full-access  capability. 
This  is  a  considerably  less  restrictive  fault-tolerance  cri¬ 
terion  than  is  used  in  any  of  the  other  networks  sur¬ 
veyed.  The  purpose  in  using  the  dynamic  full-access 
me.asure  is  to  belter  characterize  the  connectivity 
requirements  of  computer  systems  than  either  full-access 
or  rcarrangeability  (full  connection)  capability[14|.  How¬ 
ever,  the  multiple  pass  method  of  network  operation 
implied  by  the  dynamic  full  access  criterion  may  be 
nnsuited  for  some,  if  not  many,  applications. 

The  fault  model  used  for  ^networks  is  the  same 
a.ssumed  for  the  Fault-Tolerant  Uenes  network.  Thus, 
fault  tolerance  in  a  /9-netwnrk  is  considered  to  be  reten¬ 
tion  of  dynamic  full-access  using  /3-eleinents  even  with 
stuck-at  faults. 

There  arc  two  important  dis.advant.iges  to  the  d-nctwork 
approach  to  fault- tolerant  networks.  One  is  the  compu¬ 
tational  complexity  of  using  the  dynamic  full  access  cri¬ 
terion.  Even  when  faults  have  been  detected  and 
located  considerable  work  remains  to  determine  the 
opcratinnnl  status  of  the  network.  Spcciric.illy,  the  set 
of  located  faults  must  be  tested  to  see  if  it  comprises  a 
critical  fault,  one  which  destroys  dynamic  full  access. 
The  second  disadvantage  is  that  by  allowing  a  finite 
number  of  passes  through  the  network,  data  transit  time 
becomes  widely  variable.  This  will  impose  burdens  on 
an  SIMD  |7]  system  attempting  to  maintain  synchroniza¬ 
tion. 

Routing  in  a  ^'network  can  be  accomplished  using 
binary  routing  tags  with  as  many  bit  positions  as  there 
arc  stages  in  the  network  However,  /3-nctworks  consti¬ 
tute  such  a  broad  class  that  there  is  no  one  routing  tag 
scheme  generally  applicable.  Also,  realization  of 
dynamic  full  access  capability  may  incur  significant  com¬ 
putational  expense  for  routing  tags,  since  a  set  of  tags 
leading  from  the  original  source  via  a  finite  number  of 
passes  through  the  network  to  the  ultimate  destination 
must  be  generated. 


S-9  Summary  of  Network  Survey 

Table  1  summarizes  the  network  fault  tolerance  informa¬ 
tion  presented.  It  gives  the  possible  faults  that  can 
occur  in  each  network  under  the  assumed  fault  model, 
the  fault-tolerance  criterion,  the  method  in  which  the 
network  copes  with  faults,  whether  the  network  is  single 
fault  tolerant,  and  how  it  performs  when  there  are  mul¬ 
tiple  faults.  Note  that  in  the  table  the  phrase  “internal 
node  faults  only"  is  another  way  of  saying  input  and 
output  switching  elements  arc  always  fault-free. 

4.  Network  Evaluation 

There  is  a  growing  literature  on  fault-tolerant  multistage 
interconnection  networks.  However,  as  pointed  out  in 
|I0|  many  of  the  results  to  date  have  several  limitations, 
including  (1)  unrca,sonably  optimistic  fault  models,  and 
(2)  increased  data  routing  complexity.  As  noted  earlier, 
the  choice  of  fault  model  and  fault-tolerance  criterion 
plays  a  key  role  in  determining  the  fault  tolerance 
characteristics  of  a  network  In  this  section  the  I'^SC  is 
compared  with  the  other  networks  surveyed  Table  2 
summarizes  that  comparison.  The  facts  and  reasoning 
supporting  Table  2  are  discussed  below. 

I'^SC  fault  tolerance  is  evaluated  in  light  of  a  fault  model 
that  presupposes  the  possibility  of  failure  of  any  network 
component  except  the  stage  n  demultiplexers  and  stage 
0  multiplexers  which  are  treated  as  part  of  the  network 
input/output  interface.  Stage  n  multiplexer  and  stage  0 
demultiplexer  failures  are  treated  ,as  stage  n  and  stage  I 
link  failures,  respectively.  As  can  be  seen  from  Tables  1 
and  2,  this  fault  model  is  stricter  than  the  fault  models 
of  the  comparison  networks.  That  is,  it  assumes  at  least 
a.s  many  possibilities  for  failure  as  the  other  models 
(both  switching  elements  and  links)  and  dire  conse¬ 
quences  for  such  failures  (any  faulty  component  is  unus¬ 
able).  The  Esc  fault  model  may  well  be  the  most  real¬ 
istic  of  these  fault  models. 

The  fault-tolerance  criterion  for  the  ESC  is  the  same  as 
that  for  mast  of  the  networks  surveyed.  Basically,  what 
is  required  is  that  one-to-one  interconnection  capability 
be  uncomproinised.  The  Fault-Tolerant  Benes  network 
uses  the  more  demanding  criterion  that  permuting  capa¬ 
bility  be  unaffected:  all  permutations  should  still  be  per- 
formable  with  a  single  pass  through  the  network.  It  is 
appropriate  to  use  this  strict  criterion  because  the 
Fault-Tolerant  Benes  network,  unlike  the  other  net¬ 
works  considered,  is  capable  of  full  connection  capabil¬ 
ity. 

The  fault-tolerance  criterion  used  to  study  /^networks  is 
a  much  less  strict  test  to  pass.  All  that  is  required  is 
that  it  be  possible  to  connect  any  input  to  any  output  in 
a  finite  number  of  passes  through  the  network.  Succes¬ 
sive  passes  are  performed  by  returning  data  from  a  neU 
work  output  to  the  same  numbered  input.  In  a  fault- 
free  condition  a  /9-network  may  require  multiple  passes 
for  data  to  reach  its  destination,  so  the  chosen  fault- 
tolerance  criterion  is  appropriate.  However,  since  the 
cl.-iss  of  ^-networks  is  so  broad,  it  is  important  to  note 
(hat  this  forgiving  criterion  may  inflate  the  capabilities 
attributed  to  more  complex  ^networks.  The  fault- 
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tulcranre  criterion  of  the  Gainina  network  may  he 
unsuitable  for  some  computer  systems  as  it  does  not 
consider  inability  to  connect  an  in|>iit  to  the  identically 
numbered  output  Ian  identity  connection!  to  he  a 
failure.  If  the  same  device,  eg  a  processor-meniory 
pair,  is  connected  to  the  same  numhered  input  ami  out¬ 
put  then  this  is  not  a  problem,  since  a  device  sli<iiild  not 
need  to  cotnmutiicate  with  itself 

For  most  of  the  networks,  routing  m  the  presence  of 
faults  Is  little  more  complex  than  iti  the  absence  of 
faults  The  notable  exception  to  tins  is  .f-networks 
The  dynamic  full  access  proiedure  reipiires  (  hosing  a  set 
of  intermediate  oul[iuls  which  can  each  be  reached  con- 
seiutively.  such  that  the  ulliniale  deslination  can  bo 
reached  in  one  pass  from  the  input  with  the  same 
niimber  (address|  as  the  last  itileriiiedi.ile  output  .\ 
general  solution  to  this  probliiii  is  not  known  Ifoutiiig 
complexity  for  the  Fault-Tolerant  Uenes  network  is 
higher  than  for  the  FSC  because  of  the  ii.iliire  of  the 
IJem’s  network  [1-|  It  is  not  (|ue  to  the  modilic.it ion  for 
fault  tolerance. 

6.  Comparison  to  the  Extrn  Stage  Cube 

The  fault  tolerance  eapahiliiies  of  the  networks  con¬ 
sidered  are  all  reasonably  similar  given  the  various  bases 
by  which  they  are  determined  This  is  apparent  from 
the  columti  on  fault  lolerami'  capabilities  in  I'able  2 
There  should  be  no  surprise  that  this  is  so  It  is  easy  to 
agree  with  the  idea  that  a  network  should  havi'  what¬ 
ever  fault  tolerance  capabilities  are  feasible,  and  single 
fault  tolerance  is  niore  feasible  than  i-faiilt  tolerance, 
i>l.  However,  because  each  network  is  studied  using 
Its  own  fault-tolerance  model  signilicant  difTerences  in 
ca|)ubilities  might  appear  if  a  common  fault  model  is 
aihjpted. 

The  KSC  fault  model  atid  faiill-toler.itice  crilericiii  ran 
be  applied  to  the  other  surveyed  networks  in  order  to 
relate  their  fault  tolerance  to  that  of  the  liSC.  Tins 
information  is  given  in  the  first  eolumii  of  Table 
Under  the  USC  fault  model  and  fault- tolerance  criterion 
none  of  the  surveyed  networks  is  single  fault  tolerant. 
Many  of  the  networks  fail  to  be  single  fault  tolerant 
b(  (  aiise  they  cannot  tolerate  an  in|iiit  or  output  switch¬ 
ing  element  fault,  as  can  the  lOSC.  This  is  why  so  many 
of  the  fault  models  refer  only  to  internal  switching  ele¬ 
ment  faults.  If  the  IC.SC  fault  model  is  amended  to 
assume  fault-free  switching  elements  in  the  input  and 
output  stages,  some  of  the  networks  become  single  fault 
tolerant  as  sliowii  in  the  table 

Tlie  Fanlt-Tolcranl  Hems  network  is  lapablc  of  single 
fault  tolerant  opcr.ition  under  the  relaxed  USC  fault 
moihd.  Altlioiigh  faulty  components  c.innot  be  uscrl  to 
pa.vs  data  under  the  USC  fault  model  (unlike  tlie  I'aiilt- 
Toleraiil  Hones  fault  model),  only  one-to-one  conneelions 
need  bi-  supported  (,is  cot,  pared  to  periniital ion  connec¬ 
tions  for  tlie  Fault-Tolerant  Henes  fault  model).  The 
Fault- Tolerant  Hcncs  network  can  perform  any  one-to- 
one  eonneetion  without  using  a  given  faulty  eoiiiponenl. 
However,  the  control  method  given  in  |I7|  must  he 
modified  to  achieve  this  fault  tolerance  capability  so 
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llial  fault)  network  i lunpoiients  are  avoided  (tlie  given 
aigoritlim  Uses  faulty  eoinpoiients) 

Tlie  Idiliaiieeil  lAD.M  vvilli  redundant  straight  links  is 
Hot  siiigU  fault  tolerant  when  the  liSC  fault  model  is 
relaxed  because  it  still  eaiinot  tolerate  all  switching  i  le- 
meiil  failures.  This  includes  tlie  swileliiiig  element 
failures  in  interior  stages  allowed  under  the  modified 
fault  model  Tlie  additional  straight  links  provide  fault 
tolerance  against  the  hiss  of  a  straight  link,  but  iioi  a 
swili  li.  The  f.iull -loll  r.uire  capability  vvilli  re  ]iei  I  to 
switches  Is  the  same  as  the  l.\l)M.  and  tlieri  are  many 
ca.ses  where  a  swileh  failure  will  liloi  k  i  eoliiieel nui  |ll] 
d'lie  t  •aiiiiii;i  network  is  not  --iiigle  faiiil  loh'i.iiit  iiieier 
tlie  relaxed  IvSf  faiilt-l oleraln  e  model  liei  ause  It  has 
only  one  path  from  an  input  to  the  iiieiil ieally  nuniliered 
output.  siraiglil-link  fault  will  prevent  an  iiipiil  from 
( iimmiiiiieatmg  with  the  i(leiil n  ally  numhered  output  (as 
it  would  III  ihe  l,-VI)M  Meivvork  on  which  thi*  (lamm.a 
network  is  based)  Tlius,  the  Gamma  network  does  not 
satisfy  Ibo  FSC  fault-tolerance  criterion  of  full  access 

6.  Conclusions 

Figlil  faiill-lolerani  inlereonneelion  networks  b.ive  been 
dtsseribed.  All  have  multiple  stages,  but  there  are  wide 
variations  in  topology  and  switching  olement  design. 
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The  Faull-Toleraiit  Henes  network  and  ;i-iu'lw.irks  are 
both  composed  of  /?-clemenls,  but  have  difleriiig  link 
patterns.  The  Camma  network  uses  3x3  crossbar 
switching  elements  and  the  same  link  connection  pattern 
as  the  ADM  The  enhanced  LADM  and  F-nelwork  use 
5x5  and  lx-1  switches,  respectively,  which  pass  one  ilem 
at  a  tune  I'he  Augmented  Delta,  Modified  Haseline, 
and  F.SC  nelworks  arc  al!  cube-type  networks  jlO),  and 
each  iiKorporates  an  extra  stage  of  switching  elements 
to  provide  redundant  paths.  The  ESC  provides  for 
Lypa  siiig  of  faulty  input  and  output  stage  switcliing  eh^ 
111.  nl- 
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ABSTRACT 

.4  tyrfem  tf  purpate  pmeetting  retourett  thtred  ip  4 

numier  0/  penerat  purpo9€  proctannp  reeeurcet  ie  contiietti.  We 
etnmt  th*l  epecial  purpoer  praeeteinp  reeeareee  ere  dedicated  ta 
different  taeka  typical  af  complex  artificial  intetlipence  malliiaak  Jake. 
Peteiile  lypei  af  fpeciol  parpate  pracetainp  reaaarcca  inclade  pipelined 
array  pracepaart,  SIMD  parallel  praeeaaar  ayalema,  ar  MIMD  maltipra- 
eeaaar  eyateiha,  ahlh  aaaaeiated  data  iaaea  ar  knaadedye  iaaea,  far 
nameric  ar  eymtalic  campatiny.  Back  rpecifie  type  may  ia  repreaented 
iy  teveral  unilo.  Such  a  etractare  may  ie  faand  in  the  larye  local  area 
neluiorko  af  the  1990a  which  are  aaed  predaminanlly  far  arnjteial  inlelli- 
jenee,  ar  in  hiyh-end  eampalera  af  the  Sth  yeneralian.  Giaen  each  a  pra- 
ceaainy  enairanmant,  ia  Ihia  paper  an  appraaeh  far  efficienl  diatrihaled 
teak  allaeatian  to  tnirodaeed.  It  ia  referred  ta  aa  the  LOCO  appraaeh, 
ieeaaae  an  analayy  anth  a  laeamatiae  enyine  (and  appended  wayana)  ia 
aaed  ta  deaeriie  il.  An  analytic  madet  af  the  LOCO  appraaeh  ia 
deaelaped  and  aaed  far  perfarmanee  analyaia.  Reaalta  af  lAe  perfar- 
mance  analyaia  are  preaented  eamparatiaety  with  lhaae  aftaad  haianeinf 
applied  ta  the  aame  proeeooinp  enairanment.  Althaayh  aar  primary  con¬ 
cern  ia  a  praeeaeiny  enairanment  far  artificial  inteiliyenee,  wa  find  that 
the  LOCO  appraaeh  can  ie  aaed  efficiently  in  ether  typea  af  praceaainy 
enairanmenta,  aa  well. 

KEYWOROSi  Diatriialed  Taak  .dUaeatian,  Laaaely  Caapted  Mat- 
lipraceeaar  Syatema,  Artificial  Inteiliyenee  Oriented  Syatema. 

L  INTRODUCTION 

Coaveatioaal  (eaeral  purpooe  [GP)  eomputera  are  aot  able  to 
meet  the  complex  compuutioaai  requiremeau  typical  of  artilcial 
iatclligeace  (A/).  However,  the  overall  proceMio|  power  of  the  coavea* 
tioaai  CP  eomputera  eaa  be  coaoiderably  eabaaced  if  appropriate  spe¬ 
cial  purpose  proceseiai  resources  (SPPRa)  are  attached  to  them.  Ia 
that  case,  Che  CP  computers  serve  aa  the  hosCe  aod  the  SPPRs  aa  their 
computatioaal  euhaacemeaca.  Ia  such  a  proeesaiag  eaviroameat,  the 
SPPRs  ate  orieated  to  various  specialist  tasks  typical  of  complex 
multitask  AI  johs.  These  tasks  may  be  sigaai  proeessiag  |e.(.. 
RabCo7S|,  natural  laa|ua(e  processin|/uaderstaadia(  |e.(.,  Grass82|, 
visioa  processiag/uaderstaodiag  [e  g.,  Brady83|,  iatelUgeat  retriev^ 
from  knowledge  bases  within  the  expert  system  [e  g.,  HaWaL83|,  etc. 
Internally,  Che  SPPRs  may  be  orgaaiied  aa  special  fuaecioa  processors, 
systolic  arrays,  pipelined  array  processors.  SIMD  machiaes  [FIyna72|, 
or  MIMD  machines  (P(ysa73|.  Each  SPPR  will  have  aa  soaociaCed 
data  base  or  knowledge  base,  sad  will  be  orieated  to  nuaierie  or  sym- 
bolk  processing.  We  have  many  examples  of  iaceraal  orgaaixatioos 
orieated  Co  dedicated  control  [e.g.,  MilWa83,  MiluC83|,  sigaal  process 
lag  [e.g..  .McOMa82|,  speech  proceseing/uaderstaadiag  [e  g.,  Lerae80|, 
image  proeessiag/aaderstaadiag  [e.g.,  SiSiKSIj,  eScient  retrieval  from 
relatioaal  data  bases  [e.g.,  MaKaM83|,  combiaatoriai  search 
|WahMa84|,  inference  (Usbid83,  SaHoS8l|,  etc.  A  siagls  SPPR  may 
coasist  of  a  large  aumber  of  processiag  elemeau  {PEa}.  As  aa  example, 
the  MPP  processor  for  image  processiag  [Bauh80|  iaclades  2‘*  PEs. 
The  DADO  ptodnetion  system  is  supposed  to  iaclude  oa  the  order  of 
magaitude  of  a  hnadred  thousand  P&  [StoSh82|.  A  large  number  of 
PEs  is  used  to  speed  up  a  special-psrpase  computstioa,  and  aot  to 
aeqaire  general  processiag  power. 

*J.  I  Crakont  Is  son  nnk  th«  Dofonatot  at  Matkamaenal  tad  Coweowr  Seiaaem, 
Uominty  el  Uitm,  Coni  CoSIn.  rfonSo  SSIIS. 
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As  the  application  requirements  are  getting  more  and  more  com¬ 
plex.  the  overall  computational  capabilities  can  be  further  increased  by 
adding  more  SPPRs  to  a  GP  host.  Several  experiments  of  this  type 
were  reported  {MarBrdlj.  This  trend  is  very  likely  to  continue,  espe¬ 
cially  given  the  great  importance  and  massive  computational  require¬ 
ments  typical  jf  the  general  area  of  .41.  Consequently  we  expect  Chat 
ia  the  lOOQs  systems  oriented  to  .\l  will  consist  of  hundreds  of  SPPRs. 
As  the  SPPRs  will  still  be  costly,  it  will  make  sense  to  share  them 
among  a  number  of  GP  hosts.  It  is  very  unlikely  that  all  SPPRs  in 
such  a  system  will  be  of  tue  same  type,  and  it  is  even  more  unlikely 
that  each  one  will  be  different  from  the  others.  We  expect  that  they 
will  be  of  a  variety  of  different  types  where  each  type  is  represented  by 
an  appropriate  number  of  units.  Different  SPPRs  will  be  of  different 
levels  of  specialiiation.  It  is  reasonable  to  expect  that  the  whole  syv 
tern  of  GP  hosu  and  SPPRs  will  spaa  aa  area  the  site  of  a  typical 
university  campus  or  military  base  (up  to  about  1  mile  in  radius). 

A  similar  type  of  processing  environment  may  also  be  found  ia 
high-end  computers  of  the  Sth  generation.  Input/output  (ia  the  wide 
sense)  will  include  natural  language  processiag  and  computer  visioa. 
Memory  (ia  the  wide  sense)  will  incorporate  a  variety  of  data  bases  and 
knowledge  bases.  The  processor  (ia  the  wide  sente)  will  incorporate  the 
capabilities  for  both  informatioo  aod  knowledge  processiag  |TreLi82|. 

la  both  cases  it  will  be  extremely  important  to  have  aa  elBcieat 
mechanism  for  the  dynamic  allocation  of  different  tasks  bcloogiag  Co 
complex  Al  jobs  |Dsvis83|.  For  a  number  of  rcasoas,  this  mechaaisra 
must  be  distributed  ia  nature.  It  might  exbt  as  distinct  and 
identiffnble  blocks  of  code,  or  only  at  a  design  philosophy  [Easlo78. 
leaPI84|.  The  complexity  of  the  problem  is  higher  than  what  may  iaw 
tially  be  expected,  as  moot  of  the  koeU  may  be  working  in  a 
multiprogramming  environment,  where  different  processes  running  oa 
the  same  host  will  have  jobs  with  tasks  orieated  to  different  SPPRs. 
Also,  the  allocatioa  requirements  will  change  in  time.  Obviously,  the 
Mlutioo  of  tha  problem  should  involve  the  following  two  basic  aspects; 

a)  System  architecture  that  supports  an  efficient  task  allocation 

b)  Dynamic  task  allocation  procedure  which  is  dbtributed  in  nature. 
With  all  that  in  mind,  ia  this  paper  a  system  architecture  is  considered 
which  consists  of  GP  hosts  and  logwally  clustered  SPPRs,  ail  con¬ 
nected  by  a  shared  multiple  access  bus.  possibly  bat  not  necessarily  of 
the  CSMA/CD  type  [e  g.,  SbOaR82|.  Such  a  structure  is  well  suited  to 
the  execution  of  complex  multitask  jobs  typical  of  .41  sad  will  be 
referred  to  as  A/OA  (Artiflciai  Intelligence  Directed  Architecture).  For 
this  type  of  system  architecture  an  efficient  approach  to  dynamic  and 
distribated  task  allocatioa  is  introduced  and  aaalyied.  It  is  referred  to 
as  the  L  OCO  approach,  heesuae  aa  analogy  with  a  locomotive  engine 
(and  appended  wagons)  is  used  to  describe  it  |MilSi84|.  At  will  be  seen 
Inter,  this  approach  ia  quite  general  in  nature,  aod  can  be  applied  to 
processing  environments  other  than  the  one  described  here. 

It  should  be  noted  that  the  empbasa  here  is  on  a  system  that  con¬ 
sists  of  a  large  number  of  heterogeneous  classes  of  SPPRs.  made  aecew 
tible  to  a  very  large  number  of  users  through  a  targe  nnmber  of  hotts. 
Furthermore,  due  to  the  variety  of  SPPRs,  the  types  of  compuutioas 
to  be  performed  are  unlimited  —  the  system  is  not  restricted  to  any  sin¬ 
gle  task  domain.  This  makes  it  very  appropriate  for  environments, 
such  as  AI.  that  require  many  different  types  of  tasks  to  be  executed. 
This  caa  be  contrasted  to  a  parallel  processing  system  like  PASM 
(SiSiKSIj  ia  the  foUowing  ways:  (1)  PASM's  compnution  engine  con¬ 
sists  of  a  set  of  homogeneous  processors.  (2)  the  PASM  processors  are 
interconnected  by  a  multistage  network  |SiegeS4|,  rather  than  a  net¬ 
work  of  shared  busses.  (3)  when  operating  in  SIND  mode,  the  PASM 
processois  exploit  instructioa  level  parallelism,  while  the  iater-SPPR 
parallelism  is  oa  the  task  level.  (4)  PASM  is  intended  for  image  under- 
standing  |KuSiA84|.  where  LOCO/AIDA  is  much  more  "general  pnr- 
pose'  ia  nature,  (S)  P.4SM  ia  intended  to  support  a  much  smaller  set  of 
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ih;;n  LOCO/AIOA  in<i  (f^)  P  \SM  could  iq  SPPH  iq 

LO('  )/  Vir-A 

stii<|>  of  ■^i^Ulbutf^J  fr^ourte  iharioK  is  Rtv*n  m 
;W  ihS4i  A‘'«-of»!»n^  lu  a.  .\!DA  '•an  treated  as  a  r«oufce  ahartn^ 
n^iworit  irfhit#*!  on  a  iin<l^  shared  bus.  lod  L^^CO  a  pro 

(T'lir#*  *ifh  an  adfir^sstn^  m»*rbin;''m  disiribuied  in  ibe  artwork 
ihp  [.Ot'O/AIO  \  •‘Hv ironm^nt  is  rharacf^rtzrd  by  a  number  of 
<'i'':'rif*nf s  typical  of  iht*  iaia  v.ronnt^nt.  ^CiPaK^*|. 

This  pa(''**r  is  ors^anizcd  into  six  sr'-tinns  Assumptions  of  the 
ana  vsi?  ar?  a  idrcssf'd  in  Thf  sysUm  archttfcturf  {.AID AJ  \4 

.nir  '.I'lCA  i  11  i-'ciion  !il  Th»'  'listnbut^d  proredurp  for  dynamic  laak 
ill-  '  jfton  (!.0C<)1  13  n'r'du.-Ad  m  Section  [V.  Srsl  ’.hrou^b  aa  exam¬ 
ple  and  then  ^A.nr rali^A-i  A  rn^del  <if  the  LOCO  approach  based  t>Q 
MI'' \  ar-hdArJiife  ts  irao  liirAfi  m  SecUoa  V  For  compansoa  pur* 
p«)'>'-s  n  th**  'f'-tion,  a  mod-i  of  load  balancing  {LB\  applied  to 

■.ne  arr.A  ysifn>  a*';hae’:tute  i*  intr<yiured,  m  iiie|\  Performance 
\n  dv  V.3  ,,f  [.  I )'  O  ipproar  h  and  Its  yomparusoo  to  the  LB  approach 

i.'e  .  An  ii.  ''Action  \'l 

a.  SMSIC  ASSUMPTIONS  OF  THE  ANALYSIS 
Tj-  pf-'^'eniatton  to  follow  '*ill  be  based  on  the  foUoyin^  assump- 

■.  '  A- i  the  riiori'ii  I  hi''  eA,np|A\  ,.f  ef.rnpuiat.oo  ..hat  cao  be  per- 
ft'fvd  bv  tti  ■'P*'R  y^itho'it  my  intermediate  .ateractsoo  by  the  hoat. 
In  •thef  once  an  •'PFd^  is  ‘oade«<  ^ith  the  pronram.  its  parame- 

•'  f''  and  -he  dit.a  it  ean  auionomou.sly  execute  the  task  •until  us  eom- 
:,ie';nfi  The  »asks  ^re  highly  specialized  ('.insequently,  a  ^reat 
-..r-t'.  f  diffefAni  'PPH  types  \a  necessary  For  the  number  t-f 
J;.T*  •e;, .  ‘d-’FR  :  V  j  '  V/- 1  vh  A  as^'ime  '^L'»  1 

I  1'  h  'A'  •  ■  Ti  -i  part  if  'he  job  (/i  that  belongs  :o  a  process  (P)  r.jn- 
ni.iz  ,a  'mtc  •<(  th*  ho'is  :If\  \  number  •>«  pr. messes  can  run  coo 
■•i.i>  .■■■  the  Mr  ;e  ^,o^t  A  -Antie  proeese  mav  ,n  ^jde  a  number  of 
.••'inninit  'e.i^Antiaily  -:r  oncurrently ).  A  uni'le  job  consists  of  a 
••.•’Ki.ber  f  'iiffereni  la'ks  Irunrimn  «>equeoti3J'iy  or  concurreotlyb  This 
L'.c.  •  i;i  be  «'-'hoii-Mlly  fepfA<ef.ted  asm  the  example  of  Fig  ».  <"oi>* 


t  Fi’M'M''*  >f  "m '.irn  rlex  Midtuass  \l  Job 

.u  A  ■'M'l  ..r:  .i-  K  Atonal  I'eatufes 

“  r  1.  - .rmed  P»  Input  rules 

^  ;•  IX  M  •'es  P  *  Budt  lO  njifS 

•  i.)  'r  J  1  ,  ^  ''ftpuv  pr''!l-st“3 

H-  ih^  ..Tie:  vMi  jst.j  i-^kti1ata»  fo  i5f  A^amplA  >f 

.  i.n<l  V'K  J  ^a.1  .'i.n  r-'pciirrAnily  on  t»->  difle.ent 

■'  '■  :  \''K  must  '.sail  for  TA^b.  I  to  be  rompleteq.  since  it 

\  !•-  i  '  '  uipiii  daia  The '.ame  task  may  exist  in  various  coD- 
."  •  •  \s  ihA  numhe;  ./  jobs  can  be  verv  large,  it  may  help  if 

•*  •  '  -  A  •  e  ^  py  more  than  one  unit  K-'c  the  number 

•  (  .  ^ .  I «  .  #■»  r  '  V  P^*  '  V.  1  \fc  A  ;i.sflume  Nt '»  I 

.1  'I'.jniDe*’  '»f  f-'rRs  in  the  system  (N;)  ts  ^ivea  bv 
'Aa  ir'ir.pafe  that  :n  real  systems  of  the  I99t>s  ibis 

iij. fiber  m  »v  xei  »o  be  op  the  ,  rdcf  of  sevAral  hundreds  The  aumbei  •»f 

t  ts  ,  »h  ger.rt  xiors  IS  vKsunied  to  be  of  the  same  irder  of 
1X0  r  i;dA  T  hese  far'x  jdstifv  the  of  iq  mhnite  population  model 
r  •.  he  *n  oi  '  is  to  foHovy 

'  '.'.'led  exfiiAr  *he  5pPRs  mav  ^»e  more  ir  less  sper.aiiied  \ 

'pei-'iK  'in  run  oq  ime  SPPF\  type  only,  or  in  one  of  a  number  of 
Mferrnt  -pF'”  ’  V  pes  In  'he  'erond  ca.se.  however,  one  of  the  iPPH 
•v;>A^  «tU  'v-  he  m'»st  suiliMe  In  the  presentation  to  follow  a  task 
'*11'  i  Mr  I V s  Ka  '  iiA.|  with  one  •'PF'R  type,  regardless  of  if  it  is  the 

-i;v  '.oss.bd.A.  >r  the  m<'^t  suitable  possibiiuy  This  assumpii«>o 
irrpi.’brs  I  be  ireseni.ati  'n  '*itho'»l  .afferimg  its  generality 

n»«»  ot6»fW.»»  iotA<i  WA  ufum*  iftai  *wa  d’r*R  -vp*  *  rAprA*ABiA»J  b*  tke  «kai« 
ovreoer  t  "btt  «i.l  Mm^idv  be  loisiio*  Afercitf  '  br  (VBfiw.iy  if 

> « r  <  e  V  » ' » 


5  The  d'ir.iti<)n  if  t  t.v*k  ex**cution  is  considerably  longer  than  the 
time  needed  to  t^an^fef  data  lo  the  SPPR  that  will  execute  the  task 
The  traft.'‘fef  tirne  includes  the  lime  to  access  the  communications 
medium  and  lo  exchange  control  data  This  assumption  is  quite  realis¬ 
tic.  On  one  hand,  advance.!  fiber  optic  tecbnologjr  is  enabling  local 
area  communications  to  reach  gigabit/second  speeds  [e  g..  PoCoS83j 
On  the  .>thef  hand,  the  Axe-unon  time  of  tasks  may  be  extremely 
high  This  IS  .lue  t.v  AxiPnMVA  data  •juantuies  (lisks  with  ptxels 
aoil./  )f  logic  riiies  are  not  iincommon)  and  extensive  computational 
mtens.ty  <fof  both  numer  -  prex-es^ng  and  logic  search).  .As  known 
from  previtju.s  work  \e  g  M  ihSli,  .f  the  task  transmi.ssioo  time  is  small 
compareij  to  (he  ta.vk  sex.^re  ‘ime.  the  single  bus  approach  is  the  best 
appro.ach.  This  w:i>  'he  juM'ificaiion  for  us  to  concentrate  id  our 
research  on  the  ‘ingle  bus  'y  .tem  architecture  for  support  of  the  task 
allocation 

6  Each  task  in  execution  -aa  be  treated  as  a  secondary  process  (run¬ 
ning  .>n  th-  SPPR )  that  can  generate  a  number  of  secondary  jobs,  each 
one  consisting  of  a  number  .'f  secondary  tasks.  The  nesting  can  con¬ 
tinue  as  necessary  We  treat  this  issue  as  a  VERtically  Distributed 
Intert.ask:ng.  or  simply  VF.HDl  Consequently,  the  system  architecture 
13  referred  to  as  the  XfD^  b>  VFRDl  We  mention  this  nesting  as  an 
intefesung  properly  d  ih^  Lt‘-)CO  approach.  However,  that  issue  will 
not  be  fuithef  ana]y?'*d  in  this  work 

7  Al  tasks  tboih  those  predominantly  oriented  to  numeric  and  sym¬ 
bolic  processing)  are  chararterired  by  large  execution  time  variatioos 
[BradyMC.  OroszS.^'  The  «ame  conclusion  has  been  derived  by  a  recent 
study  |RobefH  |l  : '.insequentlv.  the  correlation  between  past  experi¬ 
ence  on  execution  lime  for  a  g.vea  i)  pe  of  task  and  its  future  execution 
time  IS  low  This  fact  rApreseots  one  of  the  essential  differences 
between  typical  Al  '.asks  and  the  tasks  typical  of  the  conventionar 
'TF'  pr-acessing  environment  !l  it  of  cruriai  importance  for  the  analyt%$ 
to  fjiiovj  As  will  he  seen  later,  this  fact  has  a  major  induence  on  our 
choice  .if  the  fask  allocation  procedure,  and  the  underiyiog  system 
architecture 

8  Programming  of  .he  aPPRs  ts  very  complex.  .Although  the  user  can 
dev.-Iop  ts  .jwo  ‘ffiwire,  typically  p.arametric  library  routines  are 
used  The  user  s  major  -ffort  ts  to  .specify  the  software  parameters 
The  library  rouiifies  are  vssumed  lo  be  relatively  short,  as  they  control 
<pecialized  pr.icAvxing  fA^ources  Tbe-^e  facts  loffuence  the  choice  of  the 
system  arebiteciurc  for  the  etficient  supfniri  of  task  allocation. 

9  The  are.i  over  whjrh  the  syxiem  is  spanned,  as  well  as  the  bandwidth 
of  the  rommunKations  medium,  ensure  that  the  propagation  time 
between  any  two  point.s  m  the  system  is  negligible  in  comparison  with 
duration  if  the  'h'-rieii  yossible  message  We  assume  that  enough 
bandwidth  is  ava.l.able,  that  tomputation  and  not  the  commuoicaA 
lion  IS  a  svstem  Doltleneck  This  assumption  permits  us  to  neglect  the 
meuia  acce>s  And  handshaking  effects  to  the  analysis  to  follow. 

10  Thi**  paper  .oftrAnir ales  on  the  case  when  SPPRs  are  the 
boclenerk  in  the  nv  stem  T  be  exse  eftea  both  SPF'Rs  and  iotereonnec- 
'.'on  netw.-rk  ire  'he  boiileneck  ;d  '-he  system  is  oot  considered  as  a 
part  '  f  this  work 

LTtles-  ,..Sa:miv»  iMied  jil  tbe'e  assumptions  will  be  used 
ihro  ignout  the  pr***  a ^ t.ainn  to  fuilow 

m.  SYSTEM  architecture 

One  possible  approach  to  1  system  architecture  for  the  processing 
•nviTonment  treated  here  imphes  rooDeriing  of  SPPRs  lo  the  back¬ 
ends  'f  (he  hosts  and  connecting  -if  hosts  into  a  single  shared  bus  net¬ 
work.  .vs  indiriip.j  in  Fig  Ja  This  approach  permits  load  balancing 
e  g  FiwT T<  tSl'*  but  reliability  and  ex  pand ability  may  be  problematic 
We  follow  here  another  approach  aceording  to  which  fbe  SPPRs  a/e 
moved  into  the  front-end  and  share  the  same  bus  with  the  hosts,  as 
.ndira*ed  m  Fig  Jh  This  approach  has  good  reliability  and  expanda¬ 
bility  !t  suppor'9  load  balancng  and  several  good  paper*  exist  on  that 
topi-  e  g  ,  WahJiidl  WthFliSJ}  Load  balancing  is  very  etficient  if 
•xec-j!  on  times  >r  fhe  tv'ks  waiting  for  processing  ran  be  precisely 
estimated  This  '•*«iiniatii>n  rnav  not  be  possible  nr  mav  require  inten¬ 
sive  comp'itaii.in  ir.»nT  'S(i  So  ailiKatioo  typically  relies  on  past 
experience  about  ‘he  Axeeuiiofj  lime  for  a  given  type  of  task  If  the 
corr-l.a:ioft  briweAn  pvsl  and  future  -xecutioo  lime  is  relatively  high, 
the  ioa«l  balancing  prove*  to  achieve  a  very  goo<I  performance  je  g  . 
.\iHwvil.  t'hoKorol  <  oforiuoaiely  this  a&sumption  is  notsatisffed  in 
our  caAe  and  we  -annol  'jse  (he  existing  renulte  We  were  forced  to 
search  for  appropriate  ( vk  aliocaiion  proreilure  and  a  system  .archiler- 
lure  th.al  are  -  •?(  utlS  >  cnoutr-i;'  itoui  lotk  rierulion 
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Figure  2.  Some  Potiible  System  Arebitertures  for  Distributed  Pro- 

eessiag  oa  s  LAN  (Local  Area  Network). 

(a)  The  SPPRs  ia  back-ead  of  the  hosts,  with  load 
balaaciag. 

)b)  The  SPPRs  directly  appeaded  to  the  LAN,  with 
load  balaaciag, 

(c)  Logical  clusieriag  of  physically  remote  SPPRs. 
The  SPPRs  are  assumH  to  be  physically  remote. 
However,  they  are  coaaected  by  a  high-speed  liak, 
aad  they  behave  as  if  they  were  locally  clustered, 
i.e.,  logically  clustered. 

Forluaalely,  the  eiutenee  of  s  hith-ipeed  cemmum'catiene 
medium  gives  on  important  new  dimention  to  dietriiuted  proeeeoinp.  It 
eaables  the  iatroduetioa  of  the  coacept  of  logical  clusteriag  of  physH 
cally  remote  SPPRs  of  the  same  type.  Coasequeatly,  givea  a  fast 
eaough  commuaicatioas  medium  aad  a  small  eaough  local  area,  it 
makes  sease  to  intercoaaeet  all  SPPRs  of  oae  type  iato  a  siagle  logical 
cluster,  as  iadieated  ia  Fig.  2c.  Although  physically  remote,  these 
SPPRs  behave  as  if  they  were  logically  loc^  to  each  other.  Coase- 
qucatly,  we  have  Nc;  logical  clusters,  with  Ny  SPPRs  of  the  same  type 
ia  each. 

Logical  clusteriag  mesas  that  all  SPPRs  of  the  same  type  are 
treated  as  a  siagle  multiple-server  service  statioa.  No  load  balaaciag  is 
seeded  aay  more  as  it  implies  the  eaviroamest  characterised  by  multi¬ 
ple  sisgle-wrver  service  sialioas.  .Also,  the  lack  of  correlslioa  betweea 
past  aad  future  is  of  no  importaacc  aay  more.  The  task  is  simply  seat 
to  a  logical  cluster  that  coasisu  of  all  SPPRs  which  are  best  suited  to 
its  effleieat  execution.  It  waits  la  the  queue  associated  to  the  logical 
claster  natil  after  all  the  previously  arrived  (ia  the  case  of  FIF 0  discip- 
liaes)  or  higher  priority  tasks  (ia  the  case  of  priority  discipliaes)  have 
been  served.  .Note  that  the  coacept  of  clusters  ia  our  approach  ia  cos- 
siderably  different  compared  to  the  concept  of  clusters  ia  Cm* 
|SwFttS77|,  Ultracomputer  |GoGrK83|.  or  Cedar  [GaLaKStj. 

.Now  we  will  liescribe  the  .Artificial  (atelligeace  OirectH  Architec¬ 
ture  (AIDA),  which  IS  based  on  the  above  described  principle  of  logic 
clustering.  It  is  given  m  Fig.  3.  The  system  consists  of  Nh  hosts  and 
Nj  SPPRs  organitert  into  .N,;  clusters  with  Ny  SPPRs  per  cluster. 

Each  cluster  is  .lesoriated  with  a  mass  storage  unit  M|j);  j  =  l . Nc 

This  is  where  the  software  (parametric  library  routiaesi  for  all  SPPRs 


Figure  3.  The  AIDA;  .Aa  .Architecture  for  Efficient  Support  of  the 
LOCO  .Approach  to  Distributed  Task  .Allocation. 

in  that  cluster  is  stored.  Knowledge  bases  and  data  bases  can  exist 
within  SPPRs.  in  the  mass  storage  units  associated  with  the  cluster,  or 
in  any  other  suitable  form.  The  SPPRs  are  interconnected  by  a  system 
of  buses.  Separate  buses  are  used  for  task  allocation,  for  data  aad 
parameter  transfer,  and  for  the  transfer  of  the  library  routines.  These 
buses  will  be  referred  to  as  the  allocation,  data,  and  software  bus. 
respectively.  The  allocation  bus  is  a  single-line  bus  (bit  transfer).  It 
connects  the  hosts,  the  SPPRs.  and  the  mass  storage  uniu.  as  the 
software  libraries  have  to  be  updated  occasionally  It  includes  the  clus¬ 
ter  branches.  Each  cluster  branch  is  separated  into  the 
INPL'T.BR.ANCH  and  the  OUTPUT.BR.ANCH.  The 
INPUT. BRANCH  is  daisy  chained,  as  indicated  in  Fig.  3  Given 
assumption  #8.  the  software  bus  can  also  be  a  siagle-line  bus  (bit 
transfer).  We  assume  one  softwve  bus  per  logical  ciuster.  The 
software  buses  connect  the  SPPRs  of  the  cluster  with  the  correspond¬ 
ing  mass  storage  unit  Given  assumption  #5.  the  data  but  should  be  a 
multiple-line  bus  (word  transfer).  It  connects  the  hosts  and  the  SPPRs. 
An  identification  number  {ID)  is  assigned  to  each  host  {H  ID),  process 
[PID).  job  {J.ID)  and  task  [TID).  Identification  numbers  are  also 
associated  to  the  clusters  {C  ID).  SPPRs  (S./D),  mass  storages  (.M./D), 
and  library  routines  {L  ID).  All  these  identification  numbers  act  an 
processing  environment  specifiers.  The  way  they  are  used  is  indicated 
in  Table  I.  The  short  specification  can  be  used  only  if  the  missing 
specifiers  are  known  from  the  context. 

I.  PPOcnMtc  EoFfpocacst 

ItMl-  P»il  Skarg  ipMiicaUM  Esaa»»M: 

CLUSTER  C  CJCOI  CIClDl  CMl 

host  H  )I|H  ED)  H(H  ID)  HIT) 

Jon  i  JIHIDPID^n)) 

UimARY  ROUTINE  u  LiCfDtID)  l.jl.  ID)  U.a«)a»U4) 

MA>SSTOMA«.£  M  UfCfDl  MfC  a?)  MMI 

PROCESS  P  P)HIDPID)  PfP  ID)  PE7S)o»PtSl 

srrn  s  sec  id  sot  sisiot  s(iv«iotsM) 

TA>K  T  TIHIDPOJIDTIO)  TlT  ED)  TIT.V3.U  o»  rH) 

Tbe  following  ideniificaiioo  numbers  aod  reiaied  pieces  of  iofor* 
mntioo  »re  needed  to  illocaCe  sod  run  tbe  task:  cluster  ID  (C.ID), 
library  routine  Q)  (L-ID),  program  parameters  (or  tbeir  locatioosl.  and 
the  data  (or  tbeir  locattoos).  Data  for  a  task  reside  either  in  a  single 
memory  block  of  one  of  the  system  resources  (host  or  SPPR),  or  in  a 
number  of  memory  blocks,  possibly  M)me  in  the  hosts  and  others  in  the 
5P  processors.  Each  system  resource  coniaining  a  data  block  keeps  a 
list  of  all  tasks  that  will  need  or  that  might  need  that  data  block  (until 
permission  is  given  to  delete  that  data  block).  So,  if  a  task  needs  a  data 
block  It  must  know  the  (D  of  the  system  resource  currently  holding 
that  data  block.  When  requesting  the  data  block,  the  task  baa  to 
specify  Its  own  ID  (T  ID).  Each  task  i$  associated  with  a  vector 
tbe  elements  of  which  detlne  the  sources  of  input  data  for  that  task.  A 
«r:ilar  E  is  -yIso  associated  with  each  task  Its  form  is  either  E  =  X  or 
E  =  (C  ID  IDf  It  specifies  which  SPPR  executed  that  task  Initially 
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the  of  E  'jnde^InH.  i  e  .  E  =  X  Wheo  i  iisk  i3  as-nit^aed  to  aa 
Smt.  this  prori*53of  will  3ft  ihf  V iluf  '»f  E  to  point  to  itvir  i.f  .  E  = 
(C  ID  S  ID)  Onr#  a  task  \s  fxfftitfd.  th^  output  data  will  bf  .stored  in 
thf  local  memory  of  the  SPPR  that  executed  that  task. 

Each  host  an<l  SPPR  has  attached  to  it  a  task  alloratiuo  coo- 
tfoller  Thi.s  rontfollff  is  a  hardware  device  which  executes  special 
purpose  -letliraird  software  lo  mlerface  the  host  or  SPPR  to  (he  inter- 
cf'rneruon  nriwork  md  t<i  the  rest  of  the  system  Th*  controllers  ate 
in'xTte  )  (h"  ho>i  or  SPPK  and  the  neiw(»rk  The  controller 

can  he  impkmrni<**l  a  VT.s'I  ^hip  and  will  be,  hereafter,  referred  to 
vs  .'ft'  L  0( '()  >1  ifi  'o  The  Lt  K'O  station  controls  the  access  to  all  the 
b'l'cs  :in*l  '“xe'Oites  the  laaX  .aiioralioo  procedure  So  hosts  and  SPPRi 
are  fri»e  nf  these  aciiviiies.  which  ha.s  a  number  of  positive  implications 
on  ■'.'tern  '‘t pand.ihiliiy  reliability,  and  compatibility  of  various 
hr'.rr-'^'nr..  js  'I'PRi  Di/Tereni  access  schemes  can  be  used  on  the 
h'isi-^  TS.’  c  ncrpi.  nf  ihe  *arrier  sense  multiple  access  with  colhsiaa 
'jct--':  ;n  \,"!  D)  >eefris  to  be  the  most  suitable.  However,  the 

.\r>aivMS  of  p«'3Mhie  arrrss  scheiries  will  not  be  presented  as  a  part  of 
;hi5  When  the  station  acquires  a  bus.  it  broadcasts  the  mci.saije 

With  ihe  le^ttniiion  addros  in  its  headtnit.  The  messai^e  is  accepted 
^nlv  ■  the  -'.uion  that  matches  the  .address  from  the  message  header 
'  n  'he  l:vt\  and  software  buses  the  station  responds  only  if  it  recog¬ 
nizes  ITS  .mn  .a<l«!fe%s  in  the  me^sa^e  hea<lin^  Co  (he  allocation  bus. 
ea«  h  •;i.\tioo  IS  responding  to  three  types  of  addresses  (a)  cluster 
.Tdf1r*-ve  \C  ID),  (b)  station  .address,  i.e..  SPPR  address  (C  rO  S-fD),  and 
I' I  the  address  n(  *he  train  (to  be  defined  later  j  currently  located  at 
that  3*.  uion  (II  II")  P  fD.J  fD).  If  the  address  tn  the  message  header 
con.sist.'  if  a  C  ID  only,  the  message  will  be  accepted  by  the  station 
:vss«*. .ate'i  w  rn  (he  Srst  idle  >rpR  in  the  chain  of  the  cluster  C  ID  (this 
can  l>e  ensured  r,y  appropriate  daisy  chaining^  If  the  address  to  the 
message  heailer  consists  of  both  a  C  ID  and  S  ID.  the  message  will  be 
accepted  hy  ih.  specihed  SPPR  if  the  addres.s  in  the  message  header 
'.>csi,3i.s  of  3  II  !D.  P  JD,  3nd  .1  ID.  the  message  will  be  accepted  by  the 
r  f\  curren'.l)  .a  p«issessioo  <if  the  tram  (1«j  be  defined  later), 

IV.  TASK  allocation  PROCEDURE 
The  pvsic  uie.a  in  the  L(.)CO  approach  to  distributed  task  altoev 
'.he  pjocessiQg  eovironmeol  specifiers  (see  Table  1}  in 
Tc.'iitioo  tmr.ng  different  tvsks  for  the  SPPRs  they  need.  The 
'firr*-,  denne  which  job  is  compel. ng  Tor  which  SPPR  type  jiodi- 
.1  - .  .14  ae  [Dl  Oii.  e  the  ^Pl’R  is  .vsegned  to  a  job,  m  order  to 
..-I*.-  I  ..  .  .if  ,{«  be  IS  loa'Ifd  With  the  necessary  library 

I',  me  p.if  imeief'.  i.'|i|  d:\t3.  .and  (he  exe.-ntion  of  the  task  can 

s’.'i-'  '.her,  hr  exc'-uJiMn  ‘T  the  la-sk  is  -omptefed  the  job  wd|  com- 
pc;  •  •  ■'If  new  SPPF^  that  it  needs,  .and  soon 

The  I  ./(  i}  procedure  will  be  Hfsl  presented  through  an  example 

3c.,j  -i  .viii  be  7eijefaiiied.  .\5.sume  that  the  h-ii  11(11  ID)  =  H(T)  is 

;  T.  ■  ir  .-v't  piR  IT)  P  i D  i  P (T , 'i  J  oT  .‘iborev  .,,tf"I  P'  wuh  ^  job 

.■t  '.u  J  i; ;  i  =  iiT  ■  ?.  p  or  a-bbreviaied  j(3J.  •X.^sume  that  JI3)  ron- 

>'  p  'pi  .v-.k.  cntKfTfiTfd  to  i-ig  '  tnd  abbreviated  M  T(l), 

r  '■'■I  .  -j  r,^i  \«v«i  ,’ie  i  h  .ii  '.x*.a  T(  l '  bvs  'o  be  ex  ecufed  10  ciri.v 

’  ; ontf  'i  of  -.bo  !(brar/  o.ii  rif  I.l'O,  T{T)  :a  T(.<) 
»  •  '  .  .  ’  -  T  '  .  ''".icr  I,  J  i  30d  T(  4  *  10  r(  1 1  under  l,(  I.3C  i 

•  I  ;  of.  iU  AS  Lh;*f  ;  »i7  Tot  tvk  Till  reside  m  host  HiT),  uar 

■  ( V  ,  n  .  PI  ■  b:-  -^P^  H  thii  r  tec  .<  .ed  Tjl ),  ao<l  'or  Ti-*)  in 

-  ;n  'he  •'■■‘•Hs  •.  1  ii  executed  TiJi  and  T{3).  The  final 

'o'.,  f.' njn- ..J  ;,y  oh  I|.1;fe«,de  tn  (he  ^PF’R  that  executed  T(.'/|  and 

■■i"  ‘  !•  rii»i  tr-ta(ii-.T  of  (he  above  *pet  i^fd  job  may  he  *»« 

p.  ws  Ti  1  '  ft  r*.  ip.  the  ifanifonnalioo  of  a  flying  object  image  and 

1-  f  p-  r.pf  I  file  'f'PK  of  (fie  31MD  type  T{J)  refers  to  process- 

.('g  .■  be  .,rp>mic  iigoai  (that  may  contain  some  laformattoa 

erffp.  Ti,  (.he  ihjert  launching!  anpl  s  best  executed  on  a  spe- 

Ai'ie  ,  ,M'— Iip-<1  -.rrav  ;.frx•e^^o^,  T(ol  refers  tc  image  uoderstandiog 
-.-.'i  (Tr's  :hc  ipprnpriiue  SpPR  .,f  the  MI.TD  type.  Output  data 
f-r.r,.  fj,,,  j;.  some  revsoo  needed  by  rhe  job  source.  Finally. 

T  Ip  '■efm  ,ii.\i,x.»ni  'etrifrai  from  a  knowlelge  base  wichio  an 
ex;"!  »ystern  ‘•hi  ,3  is  oneut.—i  to  ideniific alioQ  of  flying  ohjecu  The 
*r(  \v3(c,u  .icfOs  Jiput  lata  .rom  Ti2l.  T(.3).  and  the  job  source  Its 
•i(l/u'.  ti'.a  ifc  n'cded  so  the  job  lesiinalion  (same  or  another  host). 
jn<i  ils<»  a  (i'c  P‘  b  «ource  -  g  ,  fpir  iip«l.iliog  of  relevant  loformatiou 
Ac'oher  po*i^ihle  i nierpret.xt iPin  fTiay  be  lo  the  domain  -of  the  medical 
'xpc-iment  *b.*r-  7(  |  j  jj  ij  processing  and  understanding 

of  (he  scanner  irr,  ige  TI'J)  I  >  processing  '»f  an  EEC  signal,  and  T(4)  to 
ifitcjligeni  tfirifv-ij  rf.^n  .appropriate  knowledge  wubio  an  expert 
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Figure  >  Structure  of  (be  Wagon  and  Contents  of  Different 
Wagon  Sections  Depending  on  the  LOCO  version,  the 
D,\T.A  StCTK^"^!  may  contain  either  data  or  pointers  to 
data 


Now  we  describe  the  way  tn  which  the  LOCO  procedure  will 
treat  the  above  fpecified  job.  Once  the  job  J(3l  is  defined  in  the  process 
P(.'L  the  host  s  nation  corresponding  to  H{T)  will  create  the  message 
jtrai(i)  th At  consists  >f  a  number  of  subm««i5.ige5.  One  of  them  is  dedi¬ 
cated  to  the  job  J{3|  as  a  whole  ffAe  LOCOmotivej.  The  others  are 
ie  licated  to  different  tasks  tlhe  wagonn. 

The  structure  of  the  locomotive  is  shown  in  Fig.  4.  It  consists  of 
four  sections.  Sections  HOST  SECTION.  PROCESS  SECTION,  and 
J<)B. SECTION  define  the  processing  environment  of  the  correspond¬ 
ing  job  Section  D\T  V  SFXTION  defines  the  tasks  that  produce  the 
final  data  needed  by  this  job  (other  tasks  are  producing  the  intermedi¬ 
ate  data  only ).  Since  tbe  locomotive  is  playing  the  vital  role  in  the  task 
allocation  approach  under  consideration  here,  it  is  referred  to  as  tbe 
LOCO  approach.  Note  that  the  above  description  of  the  locomotive 
implies  tbe  case  when  tbe  job  source  and  tbe  job  destination  are  tbe 
same  For  tbe  case  when  job  source  may  be  different  than  tbe  tob  deo* 
tioaiton.  only  a  minor  modification  of  tbe  locomotive  is  required. 

Tbe  structure  of  the  wagon  is  shown  in  Fig  5.  Each  wagon  VV(k). 
k  =  I2.  .  .  consists  of  su  sections.  Section  CLUSTER. SECTION 
specifies  the  cluster  in  which  the  task  corresponding  to  that  wagoo  has 
to  be  executed.  Section  TASK. SECTION  specifies  tbe  task 
corresponding  to  the  wagoo.  Since  the  wagon  is  always  appended  to 
the  locomotive,  the  ^hori  specification  of  the  task  can  be  used.  Tbe  fuU 
specificatioo  can  be  obtained  by  combining  ibts  section  and  ibe  first 
three  sections  of  its  locomotive.  The 

LIBR  \HY. ROUTINE  SECTION  specifies  the  library  routine  to  be 
used  in  tbe  task  execution  Typically,  this  section  will  also  contain  the 
p.Arameters  to  be  passed  to  ibe  routine,  or  at  least,  tbe  piointers  to  these 
pafimeters  The  DATAiECTION  ^pecifie^  the  tasks  that  produce 
dat.i  for  the  task  rorresponding  to  the  wagoo.  The 
EXECUTOR  SECTION  spenfiee  tbe  particular  SPPR  that  executed 
the  task  corresponding  lo  that  wagoo  Before  the  execution  of  the  job 
xtarts.  It  IS  not  known  which  SPPR  will  do  tbe  execution  of  which  task, 
^o.  as  indicated  earlier,  ibe  contents  of  this  section  a  ioitialW  E  =  X,  as 
mentioned  earlier  The  W  Af  ;()N_ST.ATU’S  SECTION  contains  the 
.spertfier  W  that  indicaien  if  (be  task  corresponding  to  ibis  wagon  is 
currently  under  exerutioo  somewhere  lO  the  system  (W  =  IMaG)  or  not 
(W  =  RE.\1.)  If  W  =  RE.AL  ind  the  wagon  a  behind  the  locomotive, 
its  execution  is  completed  If  W  =  RE.\L  and  wagon  is  in  tbe  front  of 
the  locomotive  its  execution  did  not  start  yet. 

For  tbe  particular  case  uf  Fig  I,  the  initial  form  of  tbe  tram  a 
giveo  to  Fig.  ^a.  Initially,  the  locomotive  is  pushing  tbe  train.  Tbe 
front  wagon  corresponds  to  T(  1 1,  the  next  one  to  T(2I.  etc.  Of  course, 
the  appropriate  preamble  should  be  appended  to  tbe  front,  and  tbe 
appropriate  cyclic  redundancy  check  (CRC)  for  error  detectioo  pur¬ 
poses  to  the  end  •>(  tbe  (ram  <)oce  the  tram  is  created,  the  host  s  so 
ii<>n  will  compete  for  the  allocation  bus  and  after  accessing  it.  the  sta¬ 
tion  will  hroadrasi  the  tram.  The  CLUSTER  SECTION  of  the  front 
wagoo  (now  VV(ll)  has  the  function  of  the  tram  destinatioo  addreM, 
and  tbe  tram  will  end  up  in  the  first  currently  idle  SPPR  of  tbe  cluster 
r('3)  When  the  train  is  accepted  an  acknowledgement  will  be  sent  to 
tbe  tram  transmitter  The  ID  of  the  tram  transmitter  can  be  found  by 
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examifiifig  the  cootenU  of  the  locomotive.  If  the  first  SPPR  is  the 
chatn  IS  rurreotly  busy,  it  will  pass  the  train  to  the  aext  SPPR  in  the 
chain.  If  none  is  available,  the  train  will  "wait”  in  the  queue  until  the 
first  SPPR  becomes  available.  This  "queue”  may  physically  exist  in 
the  form  of  the  closed  cluster-dedicated  loop  within  which  the  train  is 
propagated  until  allocated.  This  is  indicated  by  dashed  lines  in  Fig.  2e. 
The  station  that  transmitted  the  train  wilt  wait  for  the  acknowledge- 
ment.  When  the  acknowledgement  is  received  it  wilt  clean  up  the  buffer 
in  which  the  tram  was  stored  and  will  use  it  for  the  other  purposes. 
Here  an  error-free  channel  is  assumed. 

Assume  one  of  the  SPPRs  is  free  (e.g.  S(7)).  Using  the  allocatioa 
bus.  the  SPPR  S(7)  will  acknowledge  the  receipt  of  the  train  and  at  the 
same  time  it  will  request  data  from  alt  the  sources  specified  in  the 
data. SECTION  of  W(l|,  i.e..  from  P(7.S),  Using  the  software  bus. 
the  SPPR  S|7)  will  request  the  library  routine  specified  in  the  program 
section  of  W|l),  i.e.,  L{5).  In  the  meantime,  before  the  data  and  pro¬ 
gram  arrive,  the  SPPR  S(7)'$  staikm  will  examine  W(2)  to  see  if  T(2) 
can  run  concurrently  with  T(l).  This  is  indicated  by  the  contents  of 
the  DATA. SECTION  of  W(2).  In  this  example,  concurrency  is  possK 
ble.  Note  that  the  wagon  must  be  in  the  station  while  the  SPPR  works 
on  its  load.  It  will  have  WsfMAC  during  that  time.  So.  W(l)  will  be 
removed  and  the  imaginary  copy  of  W(l)  will  be  appended  to  the  back 
of  the  train.  A  copy  of  the  locomotive  will  be  saved  at  tbe  station  along 
with  the  wagon.  The  Crain  {see  Fig.  fib)  will  now  be  broadcast  and 
hopefully  accepted  by  one  of  the  stations  in  Cf4),  e  g..  S(17).  The  sta¬ 
tion  S(  17)  will  acknowledge  tbe  receipt  of  the  train,  will  request  iu  data 
and  program  and  will  examine  W(3)  for  possible  concurrency.  This 
time  concurrency  will  not  be  possible,  since  T(3)  needs  data  from  T(l) 
and  the  wagon  corresponding  to  T(l)  is  imaginary  which  means  that 
T(l|  is  not  yet  completed.  So.  the  Crain  will  sit  in  the  station  S(17)  for 
some  lime.  So  far.  our  example  clearly  points  to  the  ability  of  the 
LOCO  procedure  to  exploit  maximally  the  existing  parallelism  on  the 
task  level.  Other  more  sophisticated  forms  of  parallelism  could  be  han¬ 
dled  by  the  LOCO  procedure  equally  well. 

After  some  time,  T(l)  will  be  completed.  The  station  S(7)  will 
place  (he  output  data  into  its  local  memory  and  will  "remember”  that 
the  data  will  be  needed  by  T(7,S,3.3).  That  information  is  obtained 
from  the  train  while  it  is  at  the  station.  The  station  S(7)  will  set  up 
E=S(7)  in  the  wagon  W(l),  will  append  W(l)  to  the  tram,  and  will 
broadcast  it  (see  Fig.  fic). 

The  message  from  Fig.  fic  wilt  be  accepted  by  the  station  which  is 
currently  m  the  position  of  the  train,  i.e.,  Sjl7).  The  station  S(I7)  will 
now  exchange  the  imaginary  wagon  with  the  rs^  one,  and  will  reexam¬ 
ine  if  T(3)  can  run  concurrently.  Since  now  it  can.  the  train  of  the  form 
indicated  m  Fig.  fid  will  be  broadcast.  Assume  that  this  (ram  will  be 
accepted  by  S(2,27)  and  that  T(4)  will  be  executed  in  S(l,37).  In  that 
ease,  the  tram  will  have  the  forms  indicated  in  Figs.  fie.  fif,  and  fig. 


Note  that  a  wagon  will  be  destroyed  when  it  is  not  needed  any  more. 
Finally,  the  locomotive  is  pulling  the  wagons.  Once  tbe  train  from  Fig. 
fig  IS  accepted  by  P17.S)  it  will  request  the  final  data  from  S(2.'JT)  and 
S(1.37).  At  last.  P|7.S)  wilt  broadcast  tbe  permission  to  delete  ail 
memory  blocks  corresponding  to  J(7.5.3) 

Our  example  described  the  basic  idea  of  ibe  LOCO  approach.  A 
more  rigorous  definition  can  be  easily  derived  from  this  e.x ample.  How. 
ever,  note  that  the  LOCO  approach  is  more  powerful  than  indicated  by 
tbe  example.  Instead  of  the  topology  from  Fig.  1.  any  topology  can  be 
used.  Ne.xt.  in  the  example  used  here,  the  schedule  of  the  tram  and  its 
load  (i.e..  which  type  of  SPPRs  will  be  visited  and  what  will  be  tbe 
data  sources)  is  set  up  at  the  time  when  the  train  was  created.  How* 
ever,  e^b  task  can  be  given  the  possibility  to  change  the  contents  of  all 
the  wagons  corresponding  to  the  tasks  not  yet  executed.  In  that  case 
the  task  e.xecution  is  made  conditional,  as  well  as  the  data  to  be  used. 
Also,  as  indicated  earlier,  each  SPPR  can  be  given  the  possibility  to 
treat  each  accepted  task  as  a  secondary  process  which  can  generate 
secondary  jobs  and  secondary  tasks,  where  a  new  secondary  tram  has 
to  be  associated  with  each  secondary  job.  .\lso.  it  is  very  important  to 
note  that,  under  assumption  #5.  aU  potsibU  parallelism  on  the  task  ext- 
euiton  level  can  be  fuUp  exploited  bp  the  LOCO  procedure.  Tbe  actual 
extent  to  which  the  parallelism  will  be  exploited  depends  upon  how  the 
tram  is  composed  when  it  is  generated,  i.e..  the  way  in  which  tbe  job  is 
decomposed  into  tasks  and  the  way  in  which  the  wagons  are  ordered. 

Note  that  the  LOCO  procedure  can  exist  to  various  versions.  In 
one  version,  the  train  first  competes  for  tbe  appropriate  SPPR  and 
then  collects  the  input  data  (specified  by  the  pointers  m  the  train).  In  a 
variation,  the  train  first  collects  the  data  needed  for  tbe  task  ana  then 
competes  for  tbe  appropriate  SPPR.  The  former  version  was  explained 
in  the  example,  since  we  feel  it  is  simpler.  It  needs  a  smaller  queueing 
buffer  in  each  cluster,  but  is  less  time-efficient.  The  latter  version  will 
be  treated  in  the  performance  analysis  to  follow.  It  needs  a  larger 
queueing  buffer  in  each  cluster,  but  is  more  time-effictent. 

V.  MODELLING 

We  first  develop  a  model  of  complex  multitask  job  (intertask 
model)  which  is  applicable  to  both  the  LOCO  and  LB  approaches. 
Then  we  develop  the  models  of  the  task  execution  time  (intratask 
model),  separately  for  the  LOCO  and  LB  approaches.  Load  balancing 
has  attracted  a  lot  of  research  interest,  and  some  very  good  work  has 
been  reported  recently  |e.g.,  ChoAb82.  NiJiwad),  TaoTodt.  WahJud3) 
However,  here  under  the  term  LB  approach  we  consider  the  approach 
which  is  obtained  by  applying  the  principles  of  load  balancing  to  the 
system  architecture  of  Fig.  2b. 

A.  Model  of  the  Multitnek  Job  (Intcrtnnk  Model) 

We  assume  a  complex  multitask  job  that  consists  of  J  tasks  (run¬ 
ning  serially  and/or  concurrently).  Each  task  is  serviced  by  a  general- 
ited  service  station  ((755).  .Activities  of  tbe  GSS  include  ^location  of 
tbe  task  to  one  of  tbe  appropriate  SPPRs.  collection  of  input  data  from 
appropriate  data  sources,  collectioa  of  tbe  library  routine  from  tbe 
appropriate  mass  storage  unit,  and  execution  of  tbe  task.  Tbe  only 
difference  between  tbe  GSS  for  tbe  LOCO  and  LB  approaches  is  in  the 
allocation  of  tbe  task  to  one  of  tbe  appropriate  SPPRs.  So  tbe 
differences  are  within  the  GSS  and  are  not  visible  on  the  level  of  tbe 
intertask  model.  Consequently,  both  procedures  can  be  represented  by 
the  same  open  queueing  network  model  |Kobay78|,  as  indicated  in  Fig. 
7. 

Our  analytical  model  based  on  queueing  theory  incorporates  only 
tbe  most  essential  parameters  of  two  procedures  under  consideration 
We  are  forced  to  such  an  approach  by  the  lobcrent  limitations  of 
queueing  theory. 

We  assume  an  infinite  population  queueing  network.  Task  geo- 
eratioo  does  not  depend  on  tbe  number  of  tasks  currently  existing  in 
the  network.  Task  generation  is  governed  by  the  Poisson  process. 
Rooting  of  tasks  abides  by  a  first-order  Markovian  chain.  Queueing 
discipline  at  each  GSS  can  be  any  work-conserving  one.  Service  time  is 
expofieottally  distributed.  The  task  destination  is  capable  of  absorbing 
all  tanks  departing  from  the  system.  Tbe  observation  interval  is  long 
enough  so  that  tbe  system  can  reach  a  steady  state.  Under  these  coodt- 
tioos.  Jackson  s  decompositioB  theorem  |Kobay78|  holds,  and  tbe 
steady-state  dwtnbution  of  the  probability  that  the  network  is  in  state 
ff  is  given  oy :  ^ 

pi»i  =  npi(»j 

!■! 
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Kinyff  7  Op^n  Queueing  Network  Mo<iel  of  i  Mullitaak  Job  for 
the  LOCO  ind  LB  Approaches. 

'  Branching  probabilities  for  different  tasks 
within  the  job 

Css-  C  eneralized  service  station 

-  Number  of  tasks  in  the  job 

i.j  =:  S(Source).  . J.d(Destinatioo|. 

X  -  Poisson  arrival  rate  at  the  source  node 

where  p<(n.}  is  the  marginal  distribution  of  the  variable  a((t=l . N^J. 

and  N^  refers  to  the  number  of  possible  stales  |Kobay78|  Elements  of 
the  vecfor  n  refer  to  the  number  of  tasks  in  each  of  Ng  SPPRs  of  a 
given  luster  This  (•'inclusion  implies  that  execution  of  diflerent  tasks 
within  1  complex  multitask  job  ran  be  anaiyted  indepeadeotly  one 
from  another,  regardless  of  the  intertask  data  depeoden^ry  and  other 
relevant  parameters  The  same  holds  for  both  the  LOCO  and  LB 
approaches  On  the  basis  of  this  conclusion,  in  the  next  subsection  the 
intriiask  models  for  the  LOCO  and  LB  approaches  are  introduced  and 
uaed  later  for  their  comparative  performance  analysis. 

B.  Model  of  the  Task  Execution  (Intratask  Model) 

We  roiisider  a  task  which  belongs  to  a  complex  (multitask)  job, 
lod  Its  execution  in  (be  CSS.  la  general,  input  data  for  the  task  reside 
ifl  i^fie  or  more  of  cbe  hosts  or  SPPRs  Now  we  assume  that  input  data 
resiiie  ;Q  a  given  host  Also,  we  assume  that  when  the  task  is  ready  for 
'■xer  j»  ^,1  Its  illocation.  6rst  the  input  data  have  to  be 

re  j'je  t.<»d  fti.m  the  host  The  same  applies  for  both  the  LOCO  and  LB 
xt'Of  *  V*  The  model  oT  th^  boat  as  a  source  station  for  data  retrieval 
,s  n  Kig  Ha  The  r'lFO  queueing  discipline  is  assumed.  If  the 

data  '<-TJ»’^t  It  the  ho«it  i  (i-  1.  ..,Nh)  a  Poisson  process  characterized 
»h'-  \friw.ii  f.i(«  ihe  probabiiity  density  function  (p  d./.)of  the 
i  »:a  r'*'-’je>i.  irnva;  mterv  als  is  given  by. 

;  i>0 

*  ii  if  'f  '  in  easily  be  measured  as. 

,  number  of  dal.i  renuests 


;  '  f'tr.er.il)  at  the  ho^t  i(i=l . NhI  ‘3  an 

'T*  fi«"  tial  '■h.irictenzcd  by  the  service  rate  then  the 

p  1  f  •(  Ih"  •>ervtri»  itme  is  giv^n  by 


’  >(  the  T  hsk  l.feruiioQ  .Moilel. 

I  u  The  noil  as  1  servtre  liattco  for  data  retrieval 

'hi  The  LOCO  approach 
le)  The  I  B  approach 

W<i"l  N.)  bjuim iiH  w iiiiag  lime 
'•V  (•  '■  \  *’•1  Htxi  willing  lime 


The  value  of  jxmcan  easily  be  measured  as; 

^M.1 

-  g 

b'K.l 

where  Dh.i  refers  to  ^he  average  number  of  instructions  executed  during 
data  retrieval,  and  C^^i  to  the  average  number  of  instructions  executed 
ID  a  unit  of  time.  The  utilization  factor  is  given  by: 

The  waiting  lime  p  d.f.  at  the  bt^t  is  given  by; 

f» ‘(t)  = 

Finally,  the  response  time  p  d.f.  at  the  host  is  given  by; 

=  Clt)  O  ;  t>o 

where  O  stands  for  convolution.  The  average  response  time  for  data 
retrieval  at  the  host  is  given  by: 

jKl  _  j 

N^xt.  after  the  data  are  requested  and  obtained,  the  task  is  allocated  to 
a  SPPR  according  to  the  existing  task  allocation  procedure.  Note  that 
our  mode)  of  the  LOCO  procedure  concentrates  on  a  single  cluster.  So 
the  parallel  execution  of  different  tasks  in  different  clusters  is  incor¬ 
porated  only  indirectly 

to  the  case  of  the  LOCO  approach,  the  task  is  seat  to  the  queue 
corresponding  to  the  appropriate  logical  cluster.  As  indicated  earlier, 
this  queue  may  physically  exist  in  the  form  of  the  closed  cluster* 
dedicated  loop  within  which  the  train  is  propagated  until  allocated  (see 
Fig.  Ic).  So  the  logical  cluster  can  be  modelled  as  a  single  multiple 
server  service  station.  The  FIFO  queueing  discipline  is  assumed.  We 
assume  a  Poisson  arrival  of  the  tasks  (due  to  the  decomponitwn  of  the 
complex  multitask  jobs  with  Poisson  arrivals)  with  the  arrival  rate  at 

cluster  i  (j  =  I . Nc)  equal  to  W«  iMume  aa  expoaeatial  Mrvice  al 

‘>uh  SPPR  ia  the  cluster,  with  tbe  service  rate  equal  to  u,j.  Thua.  aa 
M/M/m  queueiat  system  if  assumed,  where  m  =  Nq.  Both  ).,j  aad  a<j 
caa  easily  be  measured  ia  real  systems.  Tbe  traffic  iateasity  of  the 
cluster  is  givea  by  a^  =  KJittj  ud  tbe  utilisatioa  factor  by 
-here  mj  is  tbe  Dumber  of  SPPRs  ia  tbe 
cluster  j.  Note  that  mj  =  Ni;  uader  tbe  assumptioa  that  each  cluster 
coataias  tbe  same  aumber  of  SPPRs.  Now  we  are  temporarily  remo*- 
log  that  assumptioa  ia  order  to  make  tbe  results  more  geaeral.  Tbe 
respoase  lime  pdf.  al  the  cluster  j  is  givea  by  coavolutioa  of  tbe 
appropriate  service  time  p.d.f  aod  waiting  time  p  d  f.: 

rf'Ui  =  r'^iDOCiR) 


In  espaoded  form  Ibis  reads: 


m,;i.jn-?-j)E.(rr,,j)- - - - t>0:  f^> - 

l-m,(|-;i.ij)  ^  mj 

e  mi-I 

— - - —  i>0:  p,j<  — ' — 


E..(m..,j  =  /  V  ^  v  — iiL_ 

Fio.dly.  (hf  ivfrage  r^poosv  umc  for  tank  execution  in  tbe  logical 
cluster  13  givpo  by 

s 

r-MLi)C:)|  =  /  fr^t(dt 

w 

Nrtie  'hit  r'  *  I  *  ^roi  referx  lo  the  average  lime  that  a  task  spends  in 
th^  L‘->CO  '■luster  jfler  ihe  lata  are  retrieved 

As  already  mentioned  we  consider  a  task  which  is  part  ota  com¬ 
plex  multitasK  job  (be  model  has  to  incorporate  both  tbe  response 
time  for  data  retrieval  and  response  lime  for  task  execution.  Passing 
the  output  data,  from  the  task  under  consideration,  to  the  following 
task  IS  incorporated  into  the  model  of  tbe  following  task.*  In  condo* 

tSki  Iku  i«kk  t«  *»  iorv»f<ifU  i»  tkt  |«b  dvftiskiw*  Hawvvtr  iSit  cm 
f  ikp  xvnbvr  ->(  ^«r4tlv  •trctiru  vMti  it  irge  csosgk 
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mTZM 
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■  .  *  ..  V  .  -  . 

■  ■*  ■ 

J"  J*. 


'V  -■  .' 


&  . . 


'.-•nn:- 

s’  S*  ■ . 


'.s'  s"'.  s“  s'.  ‘  ■ 


where: 


i 

( 

( 

I  sion.  the  GSS  to  the  case  of  the  LOCO  approach  can  be  modeled  as  a 

‘  cascade  of  the  models  from  Fits.  Sa  aod  8b.  This  issue  will  be 

'  addressed  is  Sectioo  VI. 

In  the  case  o(  the  LB  approach,  the  source  of  the  tash  Irst 
‘  inquires  about  the  load  of  diSerent  SPPRs  appropriate  for  that  task. 

I  .After  that  information  is  obtained,  the  data  for  the  task  are  requested 

I  and  obtained  and  the  task  is  sent  to  the  queue  of  the  SPPR  which 

reported  the  minimal  load  (in  terms  of  the  total  estimated  execution 
tunc  of  all  tasks  currently  waiting  in  its  queue).  Note  that  the  reported 
toad  represents  the  estimated  value  (W),  and  not  the  real  value  (W). 
The  minimal  reported  value  is  not  necessarily  the  absolute  minimal 
value.  This  is  indicated  in  Fig.  3c.  In  conclusion,  the  GSS  for  the  case 
of  the  LB  approach  can  be  modelled  as  a  cascade  of  the  modeb  from 

I  Figs.  8a  and  Under  the  same  conditions  as  in  the  case  of  the  LOCO 

approach,  if  the  load  estimation  a  ideal,  the  average  task  response  time 
at  the  SPPR  should  be  given  by  the  tame  equation  for  both 
approaches: 

fj(LB,IDEAL)  =  fj(LB:o=0)  =  f^LOCO) 


where  9  refers  to  the  stsodard  deviaitoa  of  the  task  execuiioo  time 
estimate.  Id  this  ease,  the  LOCO  and  LB  approaches  are  characterised 
by  thesamoperformaoce. 

For  fair  eomparboa  of  the  LOCO  aod  LB  approaches,  a  model 
for  the  LB  approach  has  beeo  choseo  which  maximally  favorites  the 
LB  approach.  Cooscqueotly,  a  multiserver  model  has  beeo  choseo, 
with  ioformatioo  oo  the  staodard  deviatioo  of  the  task  exeentioo  time 
estimate  iocorporated  ioto  the  service  time  p.d.f. 

Accordiog  to  the  Kiogmao-KoUerstrom  approximatwo  [Kleifl70|, 
the  waiting  time  distribution  in  a  G/G/m  system  is  given  by: 

W(t)=l-. 

where  <r;  refers  to  the  vnrinnee  of  the  interarrival  time,  and  iri  to  the 
variance  of  the  service  time.  The  waiting  time  p.d.f.  f^t)  b  a^riva. 
tive  of  the  above  given  W(t).  Using  thb  approach  we  evaluate  T](LB) 
for  the  .M/G/Nq  system.  Note  (hat  the  LB  approach  b  characterixed 
by:  mj  =  Ng  =  m.  p.j  w  j,,  and  X.j  -  X.  For  G  we  select  the  gamma 
dbtribution  defined  by: 


(»<at/''e' 

m 

0 


t  >  0 
t  <  0 


with  d  3  3  and  o  ~  Ap.  We  have  chosen  an  integer  d  to  simplify  the 
analysis,  without  afiecting  its  generality.  The  value  d=3  has  been 
chosen  as  it  b  the  case  when  the  gamma  dbtribution  cloacly 
corresponds  to  the  normal  dbtribution  (KIeio7S|.  Parameter  A  has 
been  incorporated  to  enable  more  fiexible  variations  of  the  mean  and 
the  variance.  For  the  gamma  dbtributioo.  the  mean  sad  variance  are 
equal  to  if  a  and  d/o*.  respectively  [Kobay78|.  For  selected  values  of 
a  and  d.  the  service  time  p.d.f.  b  given  by: 


(Au)H»e-*** 


t  >  0 


f.<0  = 


t  <  0 


3  3 

with  the  mean  equal  to  - .  and  the  variance  equal  to  Note 

Am  A>m* 

that  J  f,(t)dt  3  0.575,  and  for  the  exponential  dbtribution  we  have: 


f  lit' ‘'it  3  0.03.  Thb  approach  allows  us  to  evaluate  the  LB  system 


performance  for  different  values  of  o  (n  was  defined  earlbr),  and  for 
varmua  appropriate  values  of  A. 

The  response  time  p.d.f.  for  LB  system  b  given  by  convolutios  of 
appropriate  servbe  time  p.d.f.,  and  waiting  time  p.d.f.: 

f4(t)  3  f.Ht)  ©  fi(t) 


where  j  3 
thb  re^s: 


1 . Ng  (number  of  SPPRs  in  a  cluster),  la  expanded  form 


(Am-C)* 

.A«M»C|t^C-AM)*-2t(C-AM)-t-2l  .... 

2(C-Am)* 


:  C  <  A(* 
;C  >  Am 


_  _  2.A*mH)l-s) 

aV+3 

Finally,  the  average  response  time  for  task  execution  (after  the  data 
arc  retrieved)  is  gives  by: 

w 

Tj(LB)  =  ;  tf^^t)dt 

Another  possibility  for  dealing  with  LB  approach  is  by  using  the 
following  assumption:  If  the  load  estimation  is  nooideal  (a^O).  then  the 
average  task  execution  time  for  the  LB  approach  should  be  given  by; 

f,(LB;oM0)  3  f^LOCO)  •  n(<r;m,) 

where  is  the  modification  function  for  the  LB  approach,  in  the 

exse  when  the  data  retrieval  time  is  not  taken  into  consideration.  The 
function  n  characterises  the  load  estimation.  The  form  of  function 
depends  on  the  type  of  estimation.  We  ^ume  that  statistical 

characteristics  of  the  estimation  error  /7|  =  W|~W|  (i=l . mj)  at  each 

station  are  the  same  and  given  by  the  teri^mean  Gaussian  distribution 
of  the  form: 

w(mi)  =  -y==t  -30</>i<  +0O 

y/Zur 

As  already  indicated,  n  b  equal  to  the  standard  deviation  of  the  load 
estimation.  It  b  very  difficult  to  obtain  an  analytic  form  of  the  func¬ 
tion  R|(7:inj).  The  family  of  curves  in  Fig.  9  b  obtained  by  simulation, 
tn  thb  figure,  the  value  of  ^  b  treated  relatively  Jo  the  average  execu* 
tioo  time  of  all  tasks  involved  in  the  simulation  (T).  The  level  of  detail 
in  our  simulation  model  was  chosen  to  correspond  to  the  level  of  detail 
in  our  analytical  model. 

Using  the  method  of  empirical-functions  smoothing  aod  applying 
it  to  Fig.  9,  it  is  possible  to  derive  analytical  expressioo  for  R(n/T;mj). 
We  assume  that  the  function  R  could  be  given  by  the  following  analyti¬ 
cal  formula: 

n(x)  3  K  •  +  1 

Coefficient  K  depends  oo  mj.  It  has  been  determined  that  it  b  equal  to 
0.77,  1.53.  3.34,  and  3.83,  for  mj  equal  to  3,  4,  8  aud  18.  respectively. 
The  standard  deviation  is  less  than  3.1^  in  all  cases.  Using  these 
results,  we  get  an  estimation  for  coefficieut  K  which  b  characterised  hy 
a  standard  deviation  lesn  than  for  all  selected  cases.  Thb  value 
reads: 


Figure  9.  Modificatioa  Function  for  the  LB  .Approach,  obtained 
from  the  Simulator. 

a  ~  Standard  deviatiou  of  the  task  executioa  lime 

_  estimate 

T  *■  Average  task  executioa  time  (executioa  oaly.  so 
waiting) 

Ng  -  Number  of  unite  in  the  cluster 


K 


k  ts  possible  lo  use  this  spprovimaliOQ  ooly  lo  the  cases  when  «e  have 
values  for  the  LOCO  approach  aod  we  aeH  to  generate  the  values  for 
the  LB  approach,  under  the  conditions  of  our  simulatioa.  Our 
simulator  is  ol  the  ‘self-driven*  type  jKobavTS).  aod  a  implemented  in 
the  SLAM  language  We  have  fully  followed  the  methodology  of 
|Kobay7S|  for  simulatioa  model  formulaiioa.  simulator  imptemeata» 
lion,  Jesign  of  the  simulation  experiments,  validation  of  the  simulatioo 
modvi.  and  analysis  of  the  simulation  data  Thmughoui  tht  iimulatton, 
the  iraiJv  jf  xndw’xduai  SPPRi  u;os  kept  'constant. 


VI.  PERFORMANCE  ANALYSIS 
We  ronsider  first  the  model  of  the  LOCO  approach  developed  in 
the  previous  section  (Figs,  ha  aod  hb).  According  to  jKobayTSj.  the 
average  time  that  a  task  spends  in  the  system  is  given  by 


where 


TylLOCO)  =  /irT„j|(t;dt 
0 

f-niJO  "  fr-til  O  fffO 


tSV  Msume  that  the  input  data  reside  m  the  host  i  (  i=l . Nh).  and 

that  the  task  is  executed  in  cluster  j  li=l.  ...N-;).  After  apptyiog  a 
senes  of  ir  aifSformatioos  we  gel: 


T.JLOCO)  = 


Mm  .{ 1  '/k  I'J '»  I  i-e, 
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Mk  V I  ~rM. 
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-M.  I  "At  j";‘M  ii  !*’#>»,, /}  I 

where* 


K.i 

mpi.j 


in  j  ^.Jnii  j)  was  defined  earher  m  the  text  The  llnt_iwo  formulas 
ap^/lv  to  the  case  m.  =  1  The  dependence  of  the  iogt^Ttj(LOCO)|  on 
m-  IS  presented  lo  Fig.  10  for  the  case  wheo  mj-N**,  aod  for  different 
V a  nes  of  and  <?mi  tra^fk  m  F ij,  lO  le  kept  can* 

iiiint.  '?;ard  cr#  t/f  the  un/ue  'j/Nu.  Conee^uenii'j.  -.orn  Mj  merenset, 
..Sc  i.oi'uiaud/  if  each  SPPH  dec'eneet  ^'ariauce  of  the  total 

iim*  that  3  task,  spends  lo  the  LC.'iCO  system  u  givec  by 


A  f  -  series  of  transformaiioos  we  gei- 
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*  Number  of  uoiU  in  the  cluster 


Average  total  lime  spent  in  the  system,  averaged  over  all  possible  data 
sources  and  task  types,  is  given  by; 

SsNc 

T(LOCO/SYSTEM)  =  ^  V;t,j(LOCO)p,j 


where  pu  refers  to  the  probability  that  input  data  reside  in  tbe  host  i 
aod  tbe  task  is  executed  in  cluster  j.  Average  queue  length  of  the  elu^ 

ter  J  (i=  I . NV)  in  terms  of  tbe  number  of  tasks  waiting  in  tbe  queue 

associated  to  cluster  j  is  given  by  |Kobay7ff|: 


Q|  =  V  (D-mj)p, 


=  ±2.. 


^ _ 


Po 
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Po  = 


o! 


I  ivo  “• 

The  variance  of  the  queue  length  is  given  by: 


-  V  (Ii-tn,l*p.-<3r  =  ^ — I,  ">  P« 


^c. 


where  p^  »  defined  above,  and  p,  in  tbe  probability  of  having  o  tasks  in 
tbe  cluster 

We  consider  now  the  model  of  tbe  LB  approach  developed  la  the 
previous  section  (Figs  8a  and  8c|.  According  to  queueing  theory,  tbe 
average  time  that  i  task  spends  in  the  system  is  given  by: 


where: 


Ttj(LBl  =  J  t  f-nuj(tWt 
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We  aesume  ihat  the  input  data  reside  in  the  host  i  (is  and  the 

t»sk  It  executed  in  the  SPPR  of  the  type  j  (js  I.-mNu).  After  applying 
X  series  of  transformations  we  get: 

e<A« 
a>c 

e<j^ 
s  <c 

O  >  A* 

»<A, 

C>  A« 

•  >A, 

where: 

.«d  D  =  . 

A-p*+3 

The  dependence  of  the  logia|T|  JLB)|  on  mj  is  presented  is  Pig.  10 
for  Tsrioos  values  of  p||,|  and  The  plotting  is  provided  for  A  = 
1.75.  Note  that  total  traffic  in  Pig.  10  is  kept  constant,  regardless  of 
the  value  of  Nu.  This  is  the  same  as  in  the  case  of  Pig.  10,  but  different 
compared  U>  pig.  9.  When  Nu  increases,  the  individual  traffic  of  each 
SPPR  decreases,  but  slower  (Fig.  10). 

Variance  of  the  total  time  that  a  task  spends  is  the  LB  system  is 
given  by: 

m 

<rf(LB)  =  ;  t=fT„j,(t)dt  -  f,»j(LB) 

0 

After  a  scries  of  transformations  we  get: 


c  <  A# 

C  <  A# 

■  « 


.1  O  >  A# 

^  •  W^*|C*AM^A«M«  K»'A«|A*(e>A*«»*Aa| 

*  wUi'ffkU  >  S 

where  D  =  *^'1*  ^  **d  A  were  deloed  earlier. 

Average  total  time  spent  in  the  system,  averaged  over  all  possible 
data  sources  and  SPPR  types,  is  given  by; 

Nu 

T(LB/SYSTEM)  =  E  D  Tv/LB)  •  p,j 

t«l  JWI 

where  Pu  refers  to  the  probability  that  input  data  reside  in  the  host  i 
and  ^e  task  needs  the  SPPR  of  the  type  j.  The  formulas  for  T(LB) 
and  f(LOCO)  match  each  other  very  closeiy  for  Nu  *  I.  Numerkai 
values  differ  only  in  the  third  decimal  digit.  _ 

For  the  LB  approach,  average  queue  length  Q)(LB)  eauM  be 
evaluated  using  Little's  formula  [Klein  7(l|: 

Q,<LB)  =  X(f/LB)  -  ■^)  =  xfj(LB)  -  ms 


where  Tj(LB)  was  deflaed  earlier,  and  index  i  is  omitted.  Alter  a  series 
of  traatformations,  we  get; 


where  C  and  A  were  detaed  earlier. 

Some  eoaclasioas  may  be  derived  from  Pigs.  9  and  10.  The 
higher  the  value  of  e  (implies  \  <  3),  the  larger  the  performance 
diCerence  between  the  LOCO  and  LB  appronches,  which  is  expected. 


In  the  environment  under  considerstioa,  as  already  mentioned  m  Sec- 
tioa  II,  the  values  of  <r  are  relatively  large  due  to  the  fact  that,  in  the  .\l 
environment,  the  correlation  between  past  values  and  future  values  of 
execution  times  for  the  same  type  of  task  may  be  very  low  This  indi- 
cates  that,  for  realistic  values  of  e,  the  performance  difference  between 
the  LOCO  sod  LB  approaches  caa  ^  relatively  high.  For  example, 
according  to  our  simulation,  for  e/T=l  (standard  deviation  of  the 
estimation  is  equal  to  the  average  task  execution  time),  sod  Nu=8 
(case  of  eight  SPPRs  in  each  cluster),  the  total  time  that  the  task 
spends  in  the  system  is  2.6  times  shorter  for  the  LOCO  approach,  com¬ 
pared  with  the  LB  approach.  Note  that  our  simulator  neglects  the  time 
needed  in  the  LB  approach  for  the  inquiry  and  processing  of  the  infor¬ 
mation  about  the  toad  of  different  SPPRs. 

A  number  of  observations  have  been  derived  from  our  analysis. 
For  example,  with  the  given  conditions,  the  higher  is  the  value  of  Nu, 
the  larger  is  the  performance  difference  between  the  LOCO  and  LB 
approaches.  However,  the  step  of  the  performance  increase  is  smaller 
for  the  higher  values  of  Nu- 

VTL  CONCLUSION 

to  this  paper  a  problem  was  recognixed,  one  of  having  a  large 
number  of  special  purpose  processing  resources  (SPPRs)  shared  by  a 
number  of  hosts.  Processing  structures  of  this  type  will  arise  in  199(1b 
around  Al  and  other  computationally  massive  applications.  Similar 
processing  structures  may  arise  in  the  high-end  computers  of  the  5th 
generation,  la  such  a  processing  structure,  it  is  of  crucial  importance 
to  have  an  efficient  procedure  for  the  distributed  allocation  of  different 
tasks  among  different  SPPRs. 

Under  the  assumptions  that  affect  the  above  described  processing 
structure,  a  distributed  task  allocation  procedure  was  introduced  which 
in  efficient  in  a  large  range  of  cirenmstances.  Both  the  task  allocation 
procedure  (LOCO)  and  the  underlying  system  architecture  (AIDA) 
were  presented  and  analysed. 

One  of  the  most  desirable  features  of  this  approach  is  that  the 
task  allocation  controller  (the  LOCO  station)  can  easily  be  imple¬ 
mented  in  a  single  VLSI  chip.  The  LOCO  station  acu  as  as  interface 
between  the  SPPR  and  the  tasks  to  be  executed  by  it.  The  LOCO  sta¬ 
tion  enables  SPPRs  of  different  types  to  be  incorporated  into  a  monol- 
itbie  task  allocation  Kheme. 
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Abatrftct 

This  work  identihes  salient  features  of  MIMD 
algorithms.  A  set  of  language  and  machine  independent 
M^U  constructs  is  proposed.  In  analysis,  algorithms  are 
reduced  to  an  equivalent  description  composed  of  these 
constructs.  These  constructs  are  at  a  low  level,  thus  one 
can  analyze  the  algorithm  performance  in  relation  to 
several  MIMD  architect  res.  At  the  same  time,  these 
constructs  are  at  a  high  nough  level  to  retain  the  basic 
structure  of  the  algorithm.  The  paper  focuses  on  issues 
of  communication  and  synchronization.  Examples  from 
Ada,  CSP,  Edison,  and  Path  Pascal  are  given. 


Introduction 

This  work  addresses  problems  in  the  analysis  of 
MIMD  algorithms.  Especially  in  the  area  of  application- 
driven  or  algorithm-driven  architecture  design,  one  would 
like  to  be  able  to  predict  the  performance  of  MIMD 
algorithms  on  different  MIMD  architectures.  Since  few 
MIMD  machines  exist,  direct  execution  is  generally  not 
possible.  Simulation  of  MIMD  processes  at  a  low  level  is 
possible  but  difficult.  The  work  here  is  intended  to 
rovide  an  extension  to  traditional  algorithm  analysis, 
eatures  such  as  inter-process  communication  and 
synchronization,  which  are  critical  to  the  performance  of 
h^MD  algorithms,  are  mapped  from  high  level  languages 
to  common,  relatively  low  level  representations  on  which 
analysis  can  be  performed. 

The  underlying  approach  will  be  to  extract  a  few 
primitive  features  of  parallel  algorithms.  These  features 
should  be  comprehensive  enough  to  cover  a  wide  range  of 
language  constructs,  while  being  simple  enough  to 
correspond  to  hardware  capabilities.  The  major  areas  of 
interest  are  communications  and  synchronization.  Models 
are  developed  to  describe  many  forms  of  these  operations 
in  a  uniform  notation.  In  the  following  sections,  some 
high  level  language  constructs  to  express  communications 
and  synchronization  are  surveyed.  We  then  identify  basic 
representations  to  which  the  high  level  constructs  can  be 
mapped.  These  low  level  representations  are  close  enough 
to  the  hardware  level  to  allow  analysis  of  the  effects  of 
hardware  on  the  execution  characteristics.  Just  as 
importantly,  the  representations  encapsulate  the  meaning 
of  the  original  algorithm.  By  developing  these  simple 
constructs,  analysis  is  simpliffed  and  unified 

MIMD  Architecture  Models 

In  MIMD  machine  designs,  two  memory 
organizations  are  common:  the  Shared  Memory  Model 
and  the  Private  Memory  Model. 

Tbis  material  is  based  on  work  supported  by  the  U.S.  Army  ReMsreb 
Office  under  Contract  DAAG2n-82-K'010l. 


The  Shared  Memory  (or  Global  Memory)  Model 
consists  of  a  set  of  N  Processing  Elements  (PEs)  with  no 
local  memory.  These  are  connected  through  a  network  to 
a  global  store.  Examples  of  this  model  include  the  NYU 
Ultracomputer*  and  C.mmp^.  The  major  advantage  of 
the  Shared  Memory  Model  is  that  all  processors  can 
access  all  of  memory.  I'his  is  important  in  the  analysis  of 
algorithms  for  this  type  of  system.  One  of  the  critical 
design  problems  in  such  a  system  will  be  the  arbitration 
network.  Much  research  has  been  devoted  to  data 
storage  schemes  to  improve  efficiency  of  the  data 
accesses*’*. 

The  Private  Memory  Model  gives  each  processing 
element  its  own  memory.  The  PEls  themselves  are 
connected  directly  through  a  network.  An  example  of 
this  is  the  PASM  system*.  The  advantages  of  the  Private 
Memory  Model  include  fast  exclusive  memory  access  for 
each  PE  to  its  own  memory.  The  associated  cost  is  the 
inability  to  access  all  of  memory  directly.  Again,  tbis  will 
appear  in  later  discussions  of  algorithm  analysis.  Siegel 
et  al.*  give  a  good  discussion  of  the  relative  benefits  of 
each  m^el. 

These  two  models  identify  one  of  the  largest  single 
differences  between  various  MIMD  architectures.  Many 
designs  contain  aspects  of  both  models.  These  models  are 
not,  therefore,  meant  to  divide  MIMD  architectures  into 
two  classes.  These  models  merely  identify  two  of  the 
most  common  approaches.  For  instance,  the  Texas 
ReconGgurable  Array  Computer  (TRAC)  combines  tbe 
two  given  models  by  providing  both  private  and  shared 
memory®’*'*.  For  the  purposes  of  the  analyses  presented 
here,  it  is  sufficient  to  show  that  results  are  valid  for  both 
models,  so  that  mixtures  of  models  will  also  produce  valid 
results. 

These  models  are  fairly  simple.  Thus,  they  are  not 
intended  to  take  all  aspects  of  a  parallel  architecture  into 
account.  Yet  in  their  simplicity,  they  distinguish  features 
of  an  architecture  that  have  a  major  bearing  on  its  ability 
U>  run  parallel  algorithms.  These  models  will  be  useful  in 
mapping  language  constructs  to  actions  in  tbe  hardware. 

Implementing  Language  Conatructa 
in  Parallel  Arcbitccturea 

Aaaumptlona 

In  this  section,  assumptions  made  in  the  subsequent 
analyses  are  outlined. 

There  is  a  distinction  to  note  between  tasks  or 
processes  and  their  relationship  to  PEs.  There  are  several 
approaches.  Each  task  can  be  statically  assi^ed  to  a 
PE.  For  Private  Memory,  this  approach  is  always 
reasonable  since  it  takes  a  significant  amount  of  time  to 
copy  a  task  in  and  out  of  a  PE's  local  memory.  For 
Shared  Memory,  thb  approach  is  ideal  when  tbe  number 
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of  tasks  is  less  than  the  number  of  I'Ks.  For  Shared 
Memory,  another  option  is  to  assign  a  task  dynamically 
to  any  available  F’E.  Over  a  much  longer  time  frame, 
this  is  feasible  for  Private  Memory  as  well.  Throughout 
the  following  discussion,  it  is  assumed  that  a  process  is 
provided  with  memory  resrjurces  and  a  PE  In  Shared 
Memory,  the  assigned  PE  may  change  over  time,  but  the 
memory  resources  are  unchangeable  by  external  events. 
For  Priv.ite  Memory,  the  memory  is  associated  with  each 
PE,  so  this  is  6xed.  So,  to  simplify  analysis,  it  will  be 
a.ssuiTied  that  once  a  memory  resource  is  allocated,  it 
'lays  with  the  process  until  the  process  releases  it.  No 
exleriiai  action  can  take  away  or  move  this  resource. 

■knolhiT  assumption  is  that  ail  communication  to 

■  iher  proce.'MS  or  ta.,ks  will  be  performed  through  some 
general  me.  hanism  This  mechanism  will  have  to  handle 
;hi  leta.ls  of  transferring  data  from  one  process  to 
anolloT  U'ing  the  available  facilities  In  a  Shared 
Mem  H)  sysieni,  data  Ls  transferred  through  a  global 
store  so  all  data  transfers  are  accomplished  in  the  same 
manner  In  a  Private  Memory  system,  interprocess 
comiminicalion  may  imply  data  is  transferred  between 
two  physical  PEs.  This  would  not  necessarily  be  the  ca.se 
wh''n  there  is  more  than  one  process  per  PE.  For  the 
'.aXe  of  sim[i!icily,  the  foliowir.g  discussion  will  assume 
!h.at  a  similar  evo'iition  penally  Ls  incurred  for  both 
cases  Except  for  a  trivial  MIMD  system  (2  PE.?),  or 
when  Iratlic  pallerns  are  expiicilly  described,  there  is  a 

hi, ;h' r  !'r  d'  ltiility  Uiat  a  transfer  will  require  accessing  a 
ptiysi.aily  distinct  PE. 

Summari/ing.  to  provide  for  a  more  concise  analysis, 
a  (.'t.  L  or  proce.ss  is  a.s.iumed  to  be  mapped  to  a  logical  PE 
aco  I  igical  memory.  In  most  ca-ses,  this  will  correspond 
!('  a  physical  I’E  and  physical  memory.  Future 
ex  I  ii.sii  11'-  of  this  analysis  may  relax  this  assumption. 

ll.e  following  sectioQS,  twei  features  of  parallel 
E  .'tii..  are  inve:  ligated  fJloba!  variables  and 
Cl  ,1.0  ,  .irai  i.vi.1  art  nmlhods  for  preicesses  to 

o  riiiim  ai p  among  themselves.  Concurrency 

.iii.iooiis  identify  code  which  can  be  executed 

■  .i.'.irtenlly .  ‘v,  nchroiiization  allows  proces-ses  to 

•  c  rre'i  algorithms  These  facilit,'','  are  cxpres.sed 

•.  'V  ,titTi  r;  .III  ways  in  different  languages  1  he  goal  is 
(  I  ."  it'i  n..  hilii'ie  'if  foims  of  expres.sions  into  a 
■  ■  1  .11  f.irm  fir  Riiviysis  Tin.s  form  should  be  close 

;  ■'  ilo  hardware  i'.ei  lo  allow  analysis  of  the 

.  iT  f  hir.lv  are  .11  the  execution  characteristics  Just 
I'.'  i:  O' '!  t.oil  I;  .  .(.is  fr.rm  should  encapsulate  the  fi.e-aning 
f  !  igiiiai  algofp  tun 

Idioha:  Variables  ar.d  Conicaiinicatlona 

E  till'  'eriion.  a  number  o'  high  level  language 
111' ch  ii.  si;.  fi,r  providing  .-ihared  memory  and  inter- 
procrs,  I  oi,.iiiiinii  ali(.n  are  surveyed  Implementations 
n  ihc  iw'i  MiMU  mtidels  are  discussed 

i.oral  IS  C:ltibal  l  anablet 

I  o  ,'.l  varritdes  are  considered  t-i  be  those  variables 
1.  .  ,  d  by  .nly  one  PE  w  hile  global  variables  are 

,1  .,..1  t.y  ni  tre  than  one  PE  This  docs  not  necessarily 

r.i.;  lv  It  lit  a  local  or  global  variable  is  kept  in  a  local  eir 
glol  .o  mem. try  V^'ht>n  speaking  of  variables,  local  and 

gt.  ‘  i!  re'er  ill  a  t'lgical  as.'asi'iation  with  processes. 
I)  t'en  nt  lvn,>;.:ages  support  this  in  different  ways. 
.\da*  as-  iimi  s  data  is  local  bt  tasks.  As  a  consequence 
of  til  •  'I'lhilily  rules,  an  object  declared  in  a  parent  task 
IS  visible  by  th'-  children,  thus  more  than  one  task  can 
screas  8  variable  In  some  situations,  this  is  a 


"dangerous'  form  of  global  variables  since  care  must  be 
taken  to  avoid  conflict  by  one  task  reading  the  variable 
and  another  task  changing  it  simultaneously.  Ada 
supports  a  library  of  implementation-dependent  routines 
which  it  calls  the  ST.-\NDAiiD  package.  To  update  these 
globally  visible  variables  in  a  reliable  way,  one  must  use 
the  SII.-\KED_V'AftLAJJEE_UPl)ATE  generic  procedure 
defined  in  the  STANUAKD  package.  This  b  a  procedure 
which  insiire.s  data  integrity  in  a  multiprocessing 
enviroDiiunt  To  do  tbb,  it  may  have  to  perform  some 
sort  of  software  interlocking.  Alternatively,  global 
vari.ihles  can  be  implemented  as  local  variables  in  a 
special  ta.sk.  This  task  does  nothing  except  repeatedly 
accept  requests  to  access  the  data.  That  “data  manager' 
la.'k  must  (hen  accept  (he  rendesvoua  by  the  processes 
requesting  the  "globar  data. 

Similarly,  Concurrent  Pascal**''*  uses  monitors  lo 
acces.s  global  variablra  Alternatively,  Modula**’*^'*^ 
assumes  all  variables  declared  within  the  main  program 
block  are  global  variables.  All  other  variables  are  local. 

These  languages  all  explicitly  provide  for  global 
variables.  The  variables  may  be  accessed  in  a  limited 
scope  or  through  a  special  meebanbm,  but  they  are 
readily  available. 

Inlerproceaa  Communication 

There  arc  some  languages  that  do  not  provide  for 
global  variablts  explicitly.  An  example  of  tbb  b  CSP**. 
I'hb  language  b  a  so-called  “message  based”  language. 
Omitting  the  global  variable  construct  forces  the 
programmer  to  use  other  methods  to  transfer  information 
between  processes,  CSP  requires  all  shared  information 
to  pa.s$  between  processes  on  clearly  defined  ckannela. 
This  performs  a  similar  function  to  the  traditional  shared 
variable.  In  fact,  through  the  use  of  a  semaphore  and  a 
global  varial'le,  a  similar  operation  could  be  performed  in 
a  langu.age  that  does  not  support  message  passing. 

(ilobal  V'oriaWr*  it.  Communication 

(.llob;il  variables,  by  Jerioition.  contain  information 
rd'vaiii  to  more  than  one  process.  Thb  may  be  in  the 
form  of  a  global  variable  name,  a  monitor^  or  a 
channel.  'I'hus,  inter-l'l:'  communication  can  be 
interpret'. 1  ns  a  form  of  global  storage.  Conversely, 
gloha'  storage  can  be  interpreted  as  a  form  of  inter-PE 
comrnunicaln.'n  Thu.s,  global  variables  and 

'-.immuiiical  ion.s  can  be  used  for  similar  purposes  and  can 
be  implemented  in  the  same  way.  In  analyzing  parallel 
■algorithm',  this  iquualence  can  be  used  to  unify  many 
language  construet.s  into  a  common  analysb  framework. 

Implementation  Kzamplet 

To  illustrate  this  further,  consider  implementing  an 
Ada  compiler  for  a  .Sh.ared  Memory  system.  In  order  to 
execute  a  SHARED, VARIABLE_UPDATE  the  code 
must  access  memory  set  aside  for  general  use.  Thb  area 
of  memory  is  designated  as  a  global  storage  area  and  all 
PEs  access  it  whenever  necessary.  The  hardware 
aecounts  for  the  arbitration  and  insures  that  a  memory 
operation  by  one  PE  cannot  be  interrupted  by  another 
PE  if  this  memory  is  accessed  often  by  several 
processes,  the  arbitration  can  introduce  a  significant 
del.iy  Overall,  thi,'.  ..rheme  b  a  quite  direct 
interpretation  of  ihr  language  construct 

Now  consider  miplement  ing  the  same  compiler  for  a 
Private  Memory  system  There  b  no  memory  accessible 
by  all  PFls  In  this  ra.se  the  inter-PE  communications 
network  plays  a  key  role  Options  include  designating  a 


‘.'■mi 


'  *•  ■  , 
■  W  «  X  -  a 


IM 


spare  PE  as  a  global  memory  handler,  spreading  the 
global  memory  among  the  PEs  randomly,  or  keeping  the 
global  memory  with  the  (parent)  task  where  it  is 
declared.  When  a  PE  needs  to  access  a  memory  location, 
it  must  pass  a  short  message  to  the  appropriate  PE 
describing  the  memory  operation.  The  remote  PE  may 
return  a  value  or  perhaps  a  write  conOrmation.  This  is  a 
somewhat  more  complicated  issue  than  merely  using  a 
hardware  arbitration  scheme  as  in  Shared  Memory. 

Conversely,  suppose  the  task  at  hand  is  to  develop  a 
CSP  compiler  for  a  Private  Memory  machine.  CSP 
provides  simple  mechanisms  for  the  passing  of 
information  on  defined  channels.  There  is  a  natural 
mapping  from  the  use  of  channel  specifications  to  the  use 
of  an  interconnection  network. 

On  the  other  hand,  to  provide  for  channeb  on  a 
Shared  Memory  machine,  one  would  have  to  set  aside 
areas  of  memory  to  simulate  the  hardware  channeb.  For 
an  algorithm  with  heavy  inter-PE  communications,  it  b 
important  that  the  compiler  place  these  memory  areas  in 
a  fashion  that  produces  few  memory  access  confiicts.  The 
memory  confiict  problem  b  an  area  of  research  in  itseif^'^, 
so  it  b  sufficient  to  note  here  that  thb  could  be  a 
significant  problem. 

Language  v».  Algorithm 

In  the  case  of  “conventional”  languages,  it  appears 
that  the  “natural”  mapping  of  global  variables  b  to  a 
Shared  Memory  machine.  Likewbe,  CSP  and  other 
“message  based”  parallel  languages  “naturally”  map  to  a 
Private  Memory  machine.  It  b  proposed  that  thb  so- 
called  “natural”  mapping  is  actually  artificial.  In  analysb 
of  parallel  algorithms,  we  wbh  to  analyze  the  algorithm, 
not  the  language.  In  particular,  the  language  should  not 
introduce  a  bias  in  favor  of  one  of  the  two  ^flMD  modeb. 

In  order  to  accompibh  thb  analysb,  an  algorithm 
must  be  stripped  of  its  language  dependencies.  Tbb  can 
be  accomplished  with  a  generic  set  of  MIMD  operations. 
Every  language  construct  must  map  to  an  equivalent 
MIMD  operation  or  set  of  MIMD  operations.  These 
operations  will  identify  a  global  memory 
access/communications  operation.  From  thb  common 
intermediate  form,  a  cost  model  for  performing  the 
required  access  on  a  particular  architecture  can  then  be 
applied.  By  mapping  the  high  level  language  constructs 
to  thb  intermediate  representation,  we  can  migrate  the 
analysis  away  from  language  dependencies  and  towards 
the  relationship  between  the  algorithm  and  the  target 
architecture. 

Concurrency  Control 

Two  aspects  of  concurrency  control  are  the 
specification  of  when  processes  can  proceed  concurrently 
and  the  converse  operation  of  preventing  (presumably 
harmful)  simultaneous  access  to  shared  resources.  In 
relating  these  concepts  to  the  performance  of  an 
algorithm  on  an  MIMD  architecture,  it  b  profitable  to 
focus  on  the  fundamental  mechanbms  by  which 
concurrency  b  regulated.  In  the  next  section,  the  use  of 
the  semaphore  as  a  viable  concurrency  primitive  for  use 
in  algorithm  analysb  b  outlined  from  two  points  of  view; 
(I)  Semaphores  can  be  used  to  express  the 
aynchronization/concurrency  indicated  by  higher  level 
language  constructs.  Thus,  independent  of  tannage, 
algorithms  can  be  mapped  to  a  representation  which  uses 
semaphores.  (2)  Semaphores  can  be  implemented  on  both 
Shared  Memory  and  Private  Memory  machines.  Thus 
the  algorithm  can  be  evaluated  with  respect  to  different 
architectures. 


Semaphores 

Concurrency  specification  b  another  facility  with 
many  forms  of  expression.  A  partial  Ibt  of  mechanbms 
in  use  includes  semaphores'^,  test  and  set'*,  guarded 
commands'*,  replace-add*®,  fetch  and  add',  fetch  and 
0*',  path  expressions**,  interface  modules'*’'^’'*, 
fork/join'*,  cobegin**'*^’**,  monitors**,  rendezvous*  event 
counts  and  sequencers**,  channeb'*,  and  messages**. 

Obviously,  there  are  many  ways  to  express 
synchronization  in  algorithms.  To  allow  analysis,  the 
goal  is  to  identify  a  common  form  which  can  describe  any 
of  these  synchronization  mechanisms.  Tbb  form  should 
be  close  to  hardware  implementations  so  that  further 
algorithm  analysis  can  correlate  the  algorithm  with  the 
architecture. 

One  of  the  oldest  synchronization  methods  are 
Dijkstra's  P  and  V''  operators'*.  These  provide  a  basb  for 
more  modern  proposab  for  synchronization  mechanisms. 
Briefly,  a  semaphore  b  an  integer  valued  variable  which 
can  have  P  and  V  operations  applied  to  it.  The  V(S) 
operation  increments  the  semaphore  S  in  an  indivisible 
fashion.  The  P(S)  operation  decrements  the  semaphore  S 
when  the  result  would  be  non-negative.  The  last  test  and 
subsequent  decrement  b  an  indivisible  operation. 

The  classical  use  for  semaphores  b  in  regulating 
access  to  shared  resources.  However,  the  semaphore  can 
also  act  as  a  low-level,  “common  denominator”  notation 
for  specifying  concurrency.  For  instance,  Edison**’**’** 
provides  a  cobegin  statement  in  which  a  parent  process 
creates  any  number  of  processes,  then  waits  for  their 
completion.  The  eobegtn  statement  can  easily  be 
expressed  using  Dijkstra’s  notation.  Suppose  three  sub¬ 
processes  are  all  concurrently  executing,  or  ready  to 
execute  given  any  scheduling  constraints.  Abo  suppose 
that  the  main  process  b  running.  This  would  require  two 
semaphores,  start  and  end  ,  both  initialized  to  0.  Fig.  1 
shows  the  code  for  the  main  process  and  one  of  the  sub¬ 
processes. 

Path  Pascal**  provides  another  good  example  of  a 
language  that  can  be  translated  meaningfully  to  P  and  V 
notation.  It  has  a  very  complicated  syntax  to  describe 
the  concurrency  of  the  program.  Tbb  involves  path 
expressions  which  specify  when  processes  can  be  invoked 
in  relation  to  other  processes.  For  instance  the  path 
expression 

path  procl  ;  proc2  end; 

signifies  that  proc2  should  run  only  after  procl  has 
completed.  Any  number  of  these  sequences  may  be 
active  at  one  time.  Likewbe  the  path  expression 

path  3:(beginproc;  endproc)  end; 

process  main 

Do  initial  processing; 

V(start); 

VI  start); 

Vfstart); 

Pfend); 

PI  end); 

P(eDd); 

Do  more  processing; 
end  process 

process  procl; 

P(  start); 

Do  useful  work; 

V(end): 
end  process 

Fig.  1.  Edison  Coroutine  Example 
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signifies  that  endproc  may  only  follow  beginproc,  and 
there  may  be  up  to  three  concurrent  executions  of  this 
sequence.  Concurrency  limitations  even  as  complex  as 
Path  Pascal’s  can  be  described  with  semaphores.  The 
condition  that  pror2  must  follow  prod  is  insured  by  a 
semaphore  S  with  initial  value  of  0,  such  that  prod 
executes  a  V(S)  at  its  end,  and  proc2  executes  a  P(S) 
before  it  begins.  Likewise,  to  limit  the  number  of 
currently  active  paths  to  N,  a  semaphore  S  is  initialized 
to  N.  A  P(S)  is  placed  at  the  beginning  of  the  path,  and 
a  \’(S)  is  placed  at  the  end 

It  is  important  to  note  that  the  definition  of  a 
^,l'nIaphore  describes  a  behavior,  not  an  implementation. 
A.S  such,  it  is  an  appropriate  construct  for  describing  the 
concurrency-related  characteristics  of  an  algorithm.  As 
discussed  below,  it  is  model-independent  in  the  sense  that 
there  are  implementations  suitable  for  both  Shared 
.Memory  and  Private  Memory  machines,  so  an  algorithm 
whose  synchronization  and  concurrency  requirements  are 
expressed  in  terms  of  semaphores  can  be  mapped  to 
either  model. 

In  the  subsequent  step  of  evaluating  an  algorithm 
with  respect  to  a  particular  architecture,  the 
implementation  will  be  of  interest.  Commonly,  the 
sem.aphore  notation  is  as.socialed  with  a  variable  in  a 
global  memory  system.  This  fc^llows  the  definition  closely 
for  the  V  operation.  The  implementation  of  the  P 
operation  is  highly  dependent  upon  the  machine 
architecture  and  even  the  operating  system.  The 
definition  does  not  specify  what  a  process  is  to  do  while  it 
is  wailing  on  the  semaphore.  Depending  on  the 
circumst.anees,  the  process  may  loop  continually  testing 
the  semaphore.  Alternatively,  the  process  may  “sleep" 
and  allow  another  process  to  use  the  same  CPU.  In  this 
ra.se,  the  P  operation  is  responsible  for  “waking  up"  the 
"sleeping"  processes.  The  sleeping  and  waking  is  often 
arcomplislied  by  intervention  by  the  operating  system. 
'I’he  second  implementation  is  prevalent  in  single  CPU 
linu“:iharing  systems.  These  implementations  seem 
suited  for  a  Shared  Memory  machine.  The  major 
prnblem  is  guaranteeing  the  mutual  exclusion  during  the 
indivisible  operations. 

Likewise,  there  are  implementations  of  P  and  V  best 
suited  for  a  Private  Memory  machine.  In  one 
iiiijd'i, nutation,  a  single  PE  would  store  the  semaphore 
in  Its  private  memory,  and  keep  a  queue  of  P  requests, 
then  respond  to  the  Ps  whenever  a  V  is  performed  on  the 
same  sciiiaidiorc.  The  example  in  Fig.  2  uses  this 
iTieihoii.  A  single  PE  has  exclusive  access  to  each 
semaphore  and  any  other  PE  must  communicate  with 
(hat  one  PE  to  gain  access  to  the  semaphore.  This 
guarantees  the  mutual  exclusiou  needed  for  the 
semaphores.  This  implementation  would  well  take 
advantage  of  a  separate  control  unit  (CU)  or  PE  to 
handle  these  actions.  In  Fig.  2,  note  that  only  the 
procedure  “handle_events”  can  actually  modify  the 
semaphore.  A  simplified  implementation  for  P  and  V  is 
also  shown  in  Fig.  2. 

Here  the  CU  has  to  reply  to  every  P  operation.  The 
requesting  PE  waits  until  it  receives  the  reply.  No  special 
restrictions  need  be  placed  on  the  accessing  of  the 
semaphore,  since  all  the  accesses  are  done  by  the  single 
CU.  If  the  value  of  the  semaphore  is  greater  than  0,  a  P 
operation  is  immediately  acknowledged  with  a  reply. 
When  the  PE  receives  the  reply,  it  continues  its 
execution.  When  the  semaphore  is  less  than  or  equal  to 
zero,  the  CU  keeps  track  of  Ps  by  keeping  a  queue  for 
each  semaphore  rontaining  the  PEs  that  have  performed 


procedure  P(seinaphore) 

8end_inessage_to_CU(PE,  P_MESSAGE,  semaphore); 
wait_for_reply_froin  _CU(  PE,  semaphore); 

end 

procedure  V(semsphore) 

send_message_lo_CU(PE,  V_MESSAGE,  semaphore); 

end 

procedure  baDdle.evcDls 

read_message_from_net(PE,  message,  S); 
case  message  in 

P_MESSAGE; 
if  S.semaphore  >  0  then 

8cnd_reply_lo_PE(PE,  S); 

S.semaphore  =  S.semaphore  -  1; 

eise 

enqueuefevent.queue,  PE); 

end  if 

V.MESSAGE: 

S.semaphore  =  S.semaphore  +  1; 
if  S.semaphore  >  0  and 
NOT_EMPTY(.S  queue)  then 
dequeuefS.queue,  PE); 
send_reply_to_PE(PE,  S); 

S.semaphore  =  S.semaphore  -  1; 
end  if 
end  case 
end  procedure 

Fig.  2.  P  and  V  Implementations 
in  a  Private  Memory  System 

a  P  on  that  semaphore  and  are  waiting  for  a  V  from 
another  PE.  It  puts  the  PE  into  a  queue  associated  with 
the  sem.aphore  S  through  the  routine  enqueue(S.queue, 
PE).  Likewise,  it  removes  a  PE  from  the  queue 
a.ssociated  with  S  through  the  routine  dequeue(S. queue, 
PE).  The  queue  length  must  be  as  large  as  the  number  of 
processes. 

For  both  Shared  Memory  and  Private  Memory 
implementations,  semaphore  access  can  become  a 
bottleneck.  For  specific  algorithm/implementation 
environments,  simulation  can  be  used  to  assess  this;  for 
the  more  general  case,  statistical  and  queuing  analyses 
can  be  applied.  Techniques  to  avoid  the  bottleneck 
involve  distributing  the  load.  An  example  of  this  in  a 
Shared  Memory  machine  is  the  use  of  Fetch  and  Add 
hardware  in  the  NYU  Ultracomputer*.  In  a  Private 
Memory  marhine,  various  PEs  (rather  than  a  single  CU) 
can  be  responsible  for  semaphores.  To  allow  a  process  to 
know  which  PE  controls  a  given  semaphore,  the  compiler 
could  a.ssociate  a  simple  lag  with  each  semaphore.  The 
distribution  of  the  semaphores  across  the  PEIs  would 
reduce  the  likelihood  of  bottlenecks,  and  the  tagging 
would  not  add  a  significant  cost  to  the  implementation. 

Extensions  to  Semaphorti 

In  the  previous  section,  the  feasibility  of  using 
semaphores  as  a  primitive  for  algorithm  analysis  was 
discussed,  in  terms  of  the  mappings  from  both  high  level 
language  to  semaphore  representation  and  from 
semaphore  representation  to  architecture.  With  some 
minor  extensions  to  P  and  V,  more  general  mechanisms 
can  be  provided  that  more  closely  correspond  to  modern 
day  architectures.  In  the  Edison  coroutine  example,  the 
parent  process  executes  three  V  operations  and  three  P 
operations.  These  operations  could  be  done  just  as  well 
with  slightly  expanded  P  and  V  operations  called  Pn  and 
Vn.  The  Vn(S,  N)  operation  adds  N  to  the  semaphore  S 
in  an  indivisible  fashion.  The  Pn(S,  N)  operation 
subtracts  N  from  the  semaphore  S  when  the  result  would 
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be  non-negative.  The  last  test  and  subsequent 
subtraction  is  an  indivisiUe  operation.  The  eobegin 
statement  in  the  Edison  main  process  would  simply  be 
implemented  as  Vn(start,  N)  followed  by  Pn(end,  N), 
where  N  is  the  number  of  parallel  branches  of  the 
algorithm. 

Pn  and  Vn  can  be  used  in  this  context  as  a  superset 
of  P  and  V  since  P(S)=Pn(S,l)  and  V(S)sVn(S.N).  Pn 
and  Vn  retained  the  meaning  t^en  from  the  higher  level 
constructs.  In  the  Edision  example,  the  Pn  and  Vn 
implementation  is  more  direct  than  the  P  and  V 
implementation.  There  b  a  trend  to  put  higher  level 
synchronization  facilities  in  parallel  languages  and 
architectures.  A  set  of  notations  which  includes  P  and  V, 
but  which  also  includes  more  complex  synchronization 
mechanbms  may  therefore  be  useful  m  analysb.  A  higher 
level  notation  must,  however,  be  able  to  map  directly  and 
equivalently  to  the  low  level  notation  (P  and  V).  Only 
then  would  it  be  applicable  to  an  architecture  which  does 
not  support  the  higher  level  capability. 

In  order  to  show  that  thb  extension  to  a  larger  set  of 
primitives  b  valid,  possible  implementations  of  Pn  and 
Vn  are  given.  Pn  and  Vn  can  be  implemented  using  only 
P  and  V  as  shown  in  Fig.  3.  The  semaphore  S  becomes  a 
variable  containing  two  parts.  S.semaphore  b  a 
semaphore  valued  variable  which  contains  the  value 
associate  with  S.  S.simple_semaphore  b  a  semaphore 
which  can  have  only  P  and  V  operations  performed  on  it. 
S.simple_semaphore  b  initialised  to  1.  Then  accessing 
S.semaphore  b  bracketed  by  P  and  V  operations  on 
S.simple_semaphore.  Thb  insures  that  only  one  process 
may  access  with  S.semaphore  at  any  time. 

One  must  take  care  in  blindly  converting  groups  of 
P  operations  into  a  single  Pn  operation.  A  problem 
occurs  if  two  or  more  processes  have  outstanding  Pn 
^erations  on  a  semaphore  with  different  values  of  N. 
The  process  with  the  largest  value  of  N  may  be  locked 
out  by  the  other  processes,  since  they  can  continue  when 
the  value  of  the  semaphore  reaches  some  value  smaller 
than  the  largest  value  of  N.  Pa  could  be  defined 
differently  to  account  for  thb  condition.  In  translating 
the  joining  of  concurrent  paths  into  a  Pn  operation,  only 
one  process  may  execute  the  Pn,  so  for  thb  analysb  thb 
b  not  a  problem. 

Since  many  languages  and  architectures  provide 
mechanbms  for  higher  level  synchronization  constructs,  it 
b  desirable  to  use  a  higher  level  mecbanbm  where 
possible.  Thb  higher  level  notation  must  satbfy  the  two 

procedure  Pn(S,  N) 
loop  forever 

if  S.seDiaphore  -  N  >  0  then  begin 
P(Sstnple_seniaphore); 
temp  =  S.senisphore-  N; 
if  temp  >  0  then 

S.aeniaphore  =  temp; 
V(S.siniple_semsphore); 
if  temp  >  0  then 
RETURNi 

end  if 
end  loop 
end  procedure 


following  conditions:  (1)  P  and  V  can  be  simply  defined  in 
terms  of  the  mechanism.  (2)  The  mecbanbm  can  be 
simply  implemented  in  terms  of  P  and  V. 

First,  in  any  analysb,  it  will  be  necessary  to 
recognize  the  P  and  V  operation  in  its  basic  form  so  it 
can  be  analyzed  in  a  consbtent  manner.  Secondly,  for 
those  machines  not  supporting  the  provided  constructs 
directly,  the  defined  operations  should  be  easily 
implemented  using  only  common  machine  instructions 
and  P  and  V  operations. 

A  Generalized  Semaphore  Notation 

It  has  been  shown  how  the  P  and  V  operations  can 
be  extended  to  the  more  general  Pn  and  Vn  operations, 
based  on  the  stated  restrictions.  There  are  numerous 
ways  to  extend  the  P  and  V  operations  to  various  forms. 
In  developing  the  NYU  Ultracomputer*'**,  a  generalized 
notation  was  developed  to  describe  Fetch  and  Add  and 
other  similar  synchronization  constructs.  Thb  notation 
will  be  borrowed,  then  extended.  The  extension  shows 
how  many  general  semaphore  mechanbms  can  be  defined 
in  terms  of  two  functions. 

The  Fetch  and  Add  operation  b  defined  as  shown  in 
Fig.  4.  The  part  of  the  operation  between  the  braces  b 
considered  indivbible. 


FeuhAndAdd(G,  L) 

{  Temp  •-  G 
G  ^G  +  L ) 
RETURN  Temp; 


Fig.  4.  FetchAndAdd  Definition 


Thb  b  equivalent  to  Vn  operation  described  earlier.  Pn 
can  be  defined  in  terms  of  Fetch  and  Add  as  shown  in 
Fig.  5**. _ 

procedure  Pn(S,  N) 
loop  forever 

if  S  -  N  >  0  then  begin 

temp  •-  FetchAodAddfS,  -  N); 

If  temp  >  N  then 
RETURN; 

else 

FetchAndAddlS,  N); 

end  if 
end  loop 
end  procedure 

Fig.  5.  Pn  in  Terms  of  FetchAndAdd 

There  are  many  possible  hardware  facilities  available 
to  support  similar  operations.  Rather  than  picking  one 
facility  as  a  basb  for  all  synchronization  mechanbms,  a 
class  of  mechanbms  b  defined  based  on  the  two 
restrictions  previously  given.  The  goal  b  to  define 
operations  which  correspond  to  P  and  V,  but  which  can 
account  directly  for  a  wider  variety  of  high  level  language 
constructs  and  hardware  implementations. 

The  first  operation  corresponds  to  V,  and  b  the 
FetchAndd  operation  proposed  in  Gottlieb  and  Kruskal*'. 
It  b  defined  as  shown  in  Fig.  6. 


proeaduru  Vn(S,  N) 

P(S.8imple_scBiapbore); 
S.semaphore  =  S.semaphore  +  N; 
V(S  .timple_sefflapbore;; 

•ud  proeaduru 


FetchAnd^G,  L) 

{  Temp  •-  G 
G  -  d(G,  L)  ) 
RETURN  Temp; 


Fig.  3.  Pn  and  Vn  as  Defined  in  Terms  of  P  and  V 


Fig.  S.  FetchAndd  Definition 


Table  1.  Cumnioo  Uses  of  KetchAnd^ 


Oner  at  ion 

KxDrrf«»ioii  lo  KeiebAod^ 

H  'IVst AmlSrl(<i  1** 

V(S| 

ViifS.  N) 

It  —  FelchAndOr(G,  TRUE) 

R  •—  FelchAndAdd(G,  L) 
a  —  FeuhAndAdd(S,  1) 

♦  •-  FelcbAndAddis.  N) 

is  a  suitably  dtTinfd  function,  G  is  a  variable,  and  I.  is 
a  value  Only  a  few  have  proven  themselves  useful  in 
synclironization  .statements,  A  few  common  <4s  are  given 
in  Table  1.  A  *  in  the  expression  indicates  that  the 
value  IS  not  used  and  need  not  be  returned. 

Logically,  this  notation  can  be  extended  to  describe 
a  cl, ass  of  operations  that  corresponds  to  the  I'  operation. 
'I'hiis,  the  notation  is  expanded  here  to  include  WaitFortf. 
This  has  two  equivalent  dcfinitions/implementations 
based  on  the  capabilities  of  the  hardware.  These  are 
shown  in  Fig  7.  As  before,  G  is  a  variable,  L  a  constant. 
The  new  argument  C  is  a  condition  that  must  be  satisfied 
before  the  function  will  return.  The  first  implementation 
assumes  the  hardware  is  capable  of  performing  a  test  and 
conditionally  performing  the  equivalent  of  FetchAndtf. 
The  second  insists  only  that  the  hardware  be  capable  of 
the  FetchAndfl  operation.  Note  that  the  second  form 
may  perform  unnecessary  steps  It  decides  that  the 
operation  should  probably  succeed  and  then  in  another 
diatinrt  step,  it  attempts  the  8  operation.  In  rase  its 
assumption  was  invalidated,  it  checks  afterwards  and 
"undoes"  the  9  function  with  * 


A  generalized  notation  for  basic  synchronization 
mechanisms  has  been  presented.  The  corresponding 
examples  in  Tables  1  and  2  show  four  pairs  of 
meehanisms  that  ran  be  described  in  terms  of  the 
generalized  notation,  including  F  and  V  themselves.  This 
model  unifies  several  of  these  types  of  mechanisms  into  a 
common  notation.  These  mechanisms  may  be  realizable 
in  hardware  for  some  architectures.  They  can  safely  be 
used  in  analysis,  since  they  can  also  be  realized  using  only 
P  and  V,  the  basic  semaphore  operations.  It  is  desirable 
to  have  a  number  of  these  mechanisms  available  for 
analysis  since  software  specifications  of  these  mechanisms 
can  be  mapped  to  the  most  natural  hardware 
implementation,  rather  than  to  a  less  obvious  and  more 
artificial  implemcn  t  at  ion . 

A  Simple  Set  of 

Language-Independent  MIMD  Operations 

An  VIIMf)  algorithm  will  consist  of  portions  that 
execute  simply  as  .serial  code  on  a  single  PE  along  with 
several  operations  specific  to  MIMD  operation.  The 
principal  such  operations  have  just  been  described. 
Throughout  the  di.srussion,  it  has  been  argued  that  the 
common  language  operations  map  into  a  few  simple 
generic  operations.  The  operations  map  as  given  in  Table 
3.  .Attempts  t(>  map  features  into  more  complex 
operations  result  in  counter-examples  of  constructs  from 
some  languages  that  will  not  map  into  the  low  level 
mechanism. 


Waiiror»lCi,  L,  C(r.)) 

loop 

(  if  ('((•)  then 

Temp  •-  G; 

G  -  nr;.  L);  ) 

if  C(Tenip)  then 

RETURN  Temp; 

end  loop 

WaitFor^lG.  t.,  C'(G).  f '(G,  1.)) 
loop 

if  C(G)  then  begin 

Temp  —  FclrhAnd#(f!.  I.) 
if  C'(Temp)  then 
RETURN 

else 

Fetch.\nd#  '(G,  l,| 

end  loop 


Temp, 


Fig  7  Waitf’orfl  Definitions 


Table  2  shows  some  examples  of  synchronization 
operations  that  can  be  expressed  in  terms  of  WaitForff 
functions  A  binary  semaphore  can  have  values  0  or  1 
(free  or  reserved).  The  Wait  on  a  Binary  Semaphore 
operation  waits  until  the  semaphore  is  free  (0)  and 
reserves  it  (sets  it  to  1)  in  one  indiviaibU  operation.  The 
other  constructs  in  the  table  are  familiar.  In  all  cases, 
the  choice  of  9,  C  and  f  *  functions  must  be  consistent 
and  must  guarantee  reliable  operation. 


Table  3  Majiping  to  Generic  Operations 


llish  level  oDeratioD 

Low  level  ooeration 

.Ariihmetir/logiral 

C'onditioDal  liraoching 

Global  Variable  Access 
Interprocess  CommuDicaiions 
Concurrency  Control 

and  Sypchroniiation _ 

Same  as  serial  analysis 
Same  as  serial  analysis 

Com  m  u  nir  at  ions/ Meniory 
Comm  unicalions/MeiDory 
P/V  or 

— ffUhAad#/.WittF.8rf _ 

Traditional  analysis  lecbniqucs  exist  for  enumerating 
the  time/space  costs  of  sii^le  arithmetic  statements  and 
loop  control  statements  The  most  significant  problems 
with  MIMD  algorithm  anajysb  stem  from  global 
variables/communications  (GV/C),  concurrency,  and 
synchronization  Because  of  the  generality  of  the  GV/C 
and  roncurrency/syncbronization  primitives  shown  here, 
it  is  possible  to  map  the  high  level  constructs  from  a  wide 
variety  of  langua|^-'>s  into  combinations  of  these 
primitives.  Thus,  in  au^ilysis,  the  occurences  of  any  given 
high  level  construct  would  map  to  a  series  of  these 
generic  operations 

Exunple 

A  simple  example  of  an  analysis  using  the  proposed 
primitives  is  presented.  Although  the  example  is  very 
simple,  it  illustrates  the  mapping  of  the  high  levd 
language  algorithm  to  a  set  of  primitives.  The  example  is 


Table  2.  CTimmon  Implementations  Using  WaitForl 


Operation 

result 

* 

G 

L 

qxj 

TWT 

Wail  on  Bin.  Sem 

♦ 

OR 

G 

1 

X  =  0 

G 

P(S) 

Subtract 

S 

1 

X  >  1 

G  +  1 

Pn(S,  N) 

Subtract 

S 

N 

X  >  N 

G  +  N 

R  —  WaitAndSubtG,  1,1 

R 

Subtract 

_s_ 

-L_ 

X  >  L  . 

G-FL 
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Abntrnct 

An  approach  is  proposed  for  modeling  off  the  shelf 
hardware  and  for  modeling  parallel  algorithms,  along  with  a 
design  methodology  to  use  the  information  provided  hy  these 
models,  to  design  a  class  of  macro-pipelined  special  purpose 
architectures.  Nine  parameters  to  fwm  a  model  of  the  charac¬ 
teristics  of  parallel/distributed  algorithms  and  the  environment 
in  which  they  must  execute  are  presented.  In  addition,  a  set  of 
tuples  to  model  the  characteristics  of  computer  architectures  is 
presented.  By  combining  the  tuples  with  the  parameters,  the 
execution  time  of  the  algorithm  modeled  by  tbe  parameters  on 
the  hardware  modeled  by  the  tuples  can  be  approximated.  The 
combination  of  these  m^els  could  be  used  as  a  basis  for  com¬ 
puter  aided  tools  used  in  the  design  of  macro-pipelined 
parallel/distributed  processors. 


1.  Introduction 

F or  certain  applications,  such  as  speech  processing,  time  is 
an  important  factor.  In  such  applications,  there  is  a  need  to  pro¬ 
cess  many  data  sets  in  the  same  way  e.g.,  coutinually  performing 
an  FFT  for  every  frame  of  input  data.  Previous  analysis,  such  as 
that  performed  in  |4|,  |S|,  |34|,  (35j,  and  |37|  indicate  that  for 
many  tyj^  of  tasks,  conventional  general  ourpose  processors 
are  insufficient.  In  this  paper,  an  approach  is  proposed  for 
modeling  off  the  shelf  hardware  and  for  modeling  parallel  algo¬ 
rithms,  along  with  a  design  methodology  to  use  the  information 
provided  by  these  models,  to  design  a  class  of  macro-pipelined 
special  ourpose  parallel  architectures.  The  ultimate  go^  is  to 
use  mooels  such  as  the  ones  proposed  here  to  develop  computer 
aided  design  tools.  Special  purpose  processing  systems  (such  as 
those  used  for  dedicated  real-time  analysis)  are  typically  sold  in 
small  quantities.  As  a  result,  the  cost  of  the  design  can  make  tbe 
resulting  system  prohibitively  expensive.  Computer  aided  design 
tools  for  this  process  would  reduce  the  cost  involved  and  are 
therefore  desirable. 

This  paper  uses  nine  parameters  to  correlate  tbe  hardware 
to  be  designed  with  tbe  applications  software  to  be  executed  and 
the  I/O  environment  in  which  the  machine  will  operate.  A 
macro-pipelined  layered  approach  to  task  decomposition  is 
demonstrated.  Each  portion  of  tbe  decomposed  task  is  then 
assigned  to  a  special  purpose  processing  unit.  This  implies  that 
each  processing  unit  may  either  be  a  traditional  serial  type 
desiu  or  a  parallel  design.  Once  this  initial  decomposition  is 
cstnhiished,  techniques  such  as  those  used  to  adjust  the  execu¬ 
tion  time  and  throughput  of  a  pipeline  in  (14|  can  be  applied. 

In  this  approach  to  reaching  tbe  goal  of  computer  aided 
computer  design,  functional  descriptions  (models)  of  the 
hardware  components  that  may  be  used  in  tbe  design  must  be 
combined  into  a  database.  Included  in  such  descriptions  are 
information  about  the  cost  of  tbe  device,  an  enumeration  of  all 
the  operations  that  it  can  perform,  and  tbe  pathwidth  and  exe- 
rution  limes  for  those  oner.stiuns  More  roni|ilrt  laxonomies, 
suih  M  those  found  lu  |•],  |\l|,  |U)|,  and  ||J|  are  uoi  iierdetl  for 
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the  database  because  they  specify  architectural  information  that 
is  unneeded  here. 

The  information  in  the  database  will  be  used  to  select 
hardware  to  perform  a  given  task  within  cost  and.  time  con¬ 
straints.  Time  constraints  are  in  two  forms,  response  time  and 
throughput.  Iittpon»t  lime  is  tbe  time  between  receiving  an 
input  and  completion  of  the  corresponding  result.  Threufhpul  is 
the  number  of  data  sets  processed  per  unit  time. 

Consider  a  task  that  is  composed  of  several  sub-tasks.  An 
example  of  such  a  task  might  be  isolated  word  recognition  |17|, 
|I9|,  (23],  and  (37).  For  isolated  word  recognition,  a  typical  pro¬ 
cessing  scenario  might  be:  digital  filtering,  autocorrelation 
analysis,  linear  predictive  coding  (LPC)  analysis,  linear  time 
warping,  and  dynamic  time  warping.  Each  of  these  processes 
(sub-tasks)  represents  a  portion  of  toe  scenario.  Each  of  these 
sub-tasks  will  be  called  a  layer.  Using  information  about  each 
sub-task  a  special-purpose  architecture  can  be  developed  to  per¬ 
form  the  sub-task  within  some  time  and  cost  constraints.  The 
special-purpose  hardware  that  is  assigned  to  each  layer  will  be 
called  a  level. 

For  simplicity,  only  scenarios  in  which  there  is  no  feedback 
will  be  considered.  Initially,  tbe  layers  will  be  chosen  according 
to  conceptual  differences,  i.e.,  digital  filtering  is  different  from 
autocorrelation  analysis,  so  each  should  be  a  different  layer. 
This  uses  the  simplifying  assumption  that  conceptually  different 
portions  of  tbe  task  (the  lasers)  will  require  di^rent  hardware 
resources  to  produce  an  initial  configuration.  The  layers  and 
their  associated  levels  of  an  isolated  word  recognition  system  are 
shown  in  Fig.  1.1. 

It  is  the  goal  of  this  scheme  to  achieve  a  higher  throughput 
by  decomposing  a  scenario  into  layers.  Because  each  layer 
requires  fewer  computations  than  tbe  entire  scenario,  connecting 
tbe  levels  in  a  macro-pipeline  and  pipelining  tbe  data  sets 
through  tbe  machine  should  increase  tbe  throughput  of  the 
resulting  system.  This  type  of  parallelism  is  referred  to  here  as 
verlicaf  parallelism.  Furthermore,  each  layer  is  executing  on  spe¬ 
cially  designed  hardware,  which  may  employ  multiple  computa¬ 
tional  units,  so  tbe  response  time  of  tbe  resulting  system  is 
decreased.  The  parallelism  occurring  within  a  given  level,  where 
multiple  units  are  performing  operations  on  different  portions  of 
the  data  set  simultaneously,  is  referred  to  as  Aorirantnf  parallel¬ 
ism.  This  vertical  and  boritontal  parallelism  is  similar  to  the 
techniques  of  subdivision  and  replication  discussed  for  pipelines 
in  |I4]  or  the  “purely  pi|)eliued"  and  the  “purely  parallel"  archi¬ 
tectures  discussed  in  [30j.  Throughput  constraiuts  may  require 
that  a  laver  must  be  further  divided  into  smaller  processes. 
These  will  not  represent  new  layers,  but  eub-layere,  which  will 
correspond  to  lub-kvelt  of  hardware,  consistent  with  tbe  previ¬ 
ous  nomenclature. 

By  developing  a  model  to  transform  a  task  description  i  o 
a  potential  macro-pipelined  architecture,  a  machine  can  be  built 
with  tbe  necessary  characteristics  to  execute  the  task  quickly 
and  wilboul  excessive  amounts  of  hardware.  .A  basis  for  such  a 
uiotlcl  is  proposed  aud  aualyird  in  this  paper.  Tbe  information 
provided  by  tbe  nine  parameters  mentioned  earlier  will  allow 
each  level  to  be  designed  for  a  specific  sub-task,  having  a  special 
hardware  complement  to  perform  that  sub-task  more  quickly. 
Each  level  can  use  SIMD  and/or  MIMD  |0j  parallelism.  The 
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Kii  1.1  Laycrmt  of  UoUl«d  Word  RccogsitioB  Syttcm 


rrsulv  of  ihe  tcfhaique  is  to  desigo  a  machiue  that  can  perform  a 
proccssiug  scenario  wiltiiu  some  time  constraints. 

It  IS  the  goal  of  ibis  paper  to  introduce  mctbotls  of  uiodel- 
.ug  hardware  and  algorithms  so  that  an  reasonable  approxima- 
1.00  of  the  execution  time  of  an  algorithm  ou  a  special-purpose 
system  is  1'os.sible.  The  hardware  model  is  discussed  in  Section  ‘2. 
An  overview  of  the  proposed  desigo  scenario  is  presented  in  Sec¬ 
tion  3  In  addition,  types  and  limitations  of  various  forms  of 
parallelism  are  discussed.  Section  4  presents  nine  parameters 
that  model  au  algorithm  and  discusses  the  calculation  and 
significance  of  each  of  the  parameters  An  example  of  the  desigo 
methodology  is  given  in  Section  5. 

3.  The  Hardware  Database 

A  processor  description  in  the  database  consists  of  two  ft- 
luplrs,  two  N-tuples,  and  three  N  4' l-tuples,  where  N  is  the 
number  of  assembly  language  instructions  (thr  “-('1’'  is  used  to 
describe  the  instruction  fetch  unit).  The  first  0-tuple  consists  of 
the  processor  name,  cost,  cluck  speed,  data  pathwidtb,  address 
pathwidth,  and  virtual  address  width.  The  second  O-tuple  con¬ 
sists  of  the  site  and  speed  of  on-board  cache,  the  site  and  speed 
of  ou-board  memory,  and  the  site  and  width  of  the  registers. 
The  N-  and  N  +  |-  tuples  must  be  able  to  answer  questions 
regarding  the  execution  time  fur  all  processors  in  the  database. 
Thus,  the  tuples  must  provide  information  about  the  type  of 
machine  instructions,  the  execution  time  fur  a  single  operation 
fur  each  instruction,  the  number  of  stages  iu  any  pipelines,  the 
replication  of  units,  and  the  overlap  of  operations.  The  tuples 
corresponding  to  the  last  three  information  categories  are 
N-fl-tuples  to  account  for  any  pipelining,  functional  overlap, 
and  parallelism  that  can  occur  within  the  instruction  fetch  unit. 
By  combining  the  information  contained  in  the  various  tuples,  it 


IS  possible  to  determine  the  exact  calculation  lime  of  all  operar 
lions  whose  times  are  constant.  For  example,  by  combining  the 
uumber  stages  in  a  pipelined  unit  with  the  single  operation  exe¬ 
cution  time  of  the  unit,  it  is  possible  to  determine  the 
throughput  of  the  unit 

Because  different  processors  have  different  instruction  sets, 
N  IS  not  the  same  for  all  processors.  Consider  the  case  of  a  sim¬ 
ple  processor  with  an  instruction  set  consisting  of  an  8- bit  add,  a 
Ib-bit  add,  a  return  ou  lero,  a  move  memory  to  register  (8-bit), 
and  a  move  register  to  memory  (8-bil).  The  first  6-luple  would 
look  like: 

(URAND/MODFL,  tS  00,  l.3psec,  8-bits,  lO-bits,  lO-bits) 
The  tuple  describing  the  type  of  machine  instructions  would 
look  like: 

8-bit  add  register  to  register 
10-bit  add  register  to  register 
return  if  scro 

8-bit  move  memory  to  register 
8-bit  move  register  to  memory 

For  this  tuple,  both  the  source  and  destination  must  be 
enumerated.  This  allows  for  processors  like  the  8085  in  which 
registers  can  only  be  added  to  the  accumulator. 

The  other  tuples  contain  the  types  of  information  men¬ 
tioned  earlier,  where  information  in  the  i*^  element  corresponds 
to  the  instruction.  By  including  ibis  information  in  the  data¬ 
base,  it  is  possible  to  recreate  the  liming  information  stored  in 
the  architecture  description  set  forth  in  |12|. 

For  the  purposes  of  this  paper,  the  units  considered  for  the 
database  are  either  single  chips  or  small  boards.  The  underlying 
assumption  for  this  .scheme  is  that  there  is  no  shared  or 
reconfigurable  pipeline  units  on  board.  When  this  assumption 
becomes  false,  two  N  +  I-tuples  will  be  required  to  represent 
shared  pipelines  and  their  reconfiguration  limes.  Other  factors 
that  should  be  included  in  the  data  base  are  power  consumption, 
heat  dissipation,  and  site.  While  these  last  three  factors  do  not 
infiuence  performance,  they  do  provide  necessary  application 
information  about  the  possible  environment  in  which  the  chips 
can  operate. 

A  functional  description,  such  as  that  found  in  |2),  can  be 
used  to  accurately  calegorite  each  unit  according  to  its  func¬ 
tional  capabilities.  To  this  point,  only  processing  hardware  baa 
been  considered  The  hardware  database  can  be  divided  into  the 
functional  units  of  processor,  memory,  input/outpul,  vector, 
and  array  processors.  This  is  consistent  with  |2|.  Each  func¬ 
tional  unit  will  have  its  own  set  of  tuples  used  to  describe  its 
performance.  The  tuples  will  be  used  with  the  characteristics  of 
the  application  algorithm  to  choose  specific  hardware  for  each 
level  of  the  system. 

Included  with  the  hardware  descriptions  of  the  processors 
in  the  database  would  be  a  routine  that  can  simulate  that  pro¬ 
cessor.  By  combining  the  simulation  procedures  with  the  archi¬ 
tectural  information  of  other  components  in  the  database,  c.g., 
memories,  it  is  possible  to  create  a  simulator  for  the  proposed 
macro-pipelined  architecture.  Such  a  database  with  simulation 
routines  for  each  relevant  component  would  be  a  useful  tool  for 
the  research  community  interested  in  the  design  of  macro- 
pipelined  special  purpose  systems.  These  tools  would  be  used 
according  to  the  approach  presented  in  the  next  section. 

3.  The  Design  Scenario 

After  the  initial  layering  is  performed,  an  exact  statement 
of  the  application  algorithm  to  be  performed  at  each  level  is 
needed.  This  is  done  using  nine  parameters  that  are  discussed  in 
the  next  section.  This  information  is  then  used  in  conjunction 
with  the  hardware  deKripliou  to  evaluate  the  performance  of 
each  processor  iu  the  hardware  database.  Then  information 
about  the  desired  throughput  and  average  desired  response  time 
of  the  system  must  he  gathered.  These  will  be  the 
evaluation  criteria,  i.e.,  can  a  proposed  system  process  the  data 
with  the  desired  throughput  and  response  lime. 

The  first  step  in  the  modeling  process  is  to  choose  all  levels 
to  process  their  incoming  data  as  fast  as  possible  without  using 
vertical  or  horitonlal  parallelism  within  any  given  level.  Since 
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(bis  type  of  desigu  is  it  macrO'piptfliDe,  tbe  ibroughpul  of  ike 
pipeline  is  limited  by  (he  slowest  level. 

Macro-pipelined  architectures  produce  a  continuous  Bow  of 
data.  Tbe  time  to  process  a  single  data  set  (tbe  time  for  data  to 
go  from  tbe  first  level  to  tbe  last  level,  i.e.,  tbe  response  time)  in 
such  a  vertical  architecture  is  same  as  for  a  single- processor 
serial  system  because  the  data  must  be  processed  by  the  multi¬ 
ple  levels  of  hardware.  The  throughput  for  multiple  data  sets  is 
greatly  increased  because  new  results  are  completed  at  a  rate 
equal  to  the  processing  time  of  tbe  slowest  level  or  sub-level.  If 
tbe  time  to  go  from  tbe  first  level  to  tbe  last  level  is  loo  slow, 
since  the  levels  are  designed  with  the  fastest  serial  processors  in 
tbe  database  and  only  off  the  shelf  parts  are  allowed,  boritontal 
parallelism,  such  as  that  found  in  SIMD  or  MIMD  machines, 
must  be  applied.  For  example,  if  tbe  processing  time  fur  all  lev¬ 
els  and  sub-levels  of  an  arcbitecture  were  halved,  the  time  to  go 
from  the  first  level  to  the  last  level  would  also  be  halved.  Thus, 
vertical  parallelism  can  be  applied  to  increase  throughput,  while 
boritontal  parallelism  can  be  applied  to  increase  throughput  and 
decrease  response  time. 

if  the  required  throughput  is  1  job/T-r  seconds,  than  each 
level  must  execute  its  layer  in  at  most  Tf  seconds.  If  the 
machine  fails  to  meet  the  throughput  qualification,  the  execu¬ 
tion  speed  of  all  levels  not  meeting  the  time  constraint  (T-p) 
must  be  increased.  This  can  be  accomplished  with  the  previously 
discussed  horizontal  and  vertical  parallelism. 

The  maximum  amount  of  horizontal  parallelism  that  can 
be  applied  to  a  task  is  the  inherent  parallelism  of  the  subtask  to 
be  performed  (the  minimum  horizontal  parallelism  at  any  level 
is  a  single  unit).  Further,  horizontal  parallelism  is  affected  by 
precedence  constraints  of  (he  subtask.  Typically,  each  addi¬ 
tional  processor  used  for  horizontal  parallelism  wilt  not  increase 
the  execution  speed  linearly,  i.e.,  the  speedup  ma^  be  less  than  a 
factor  of  P  using  P  processors  for  any  P.  This  is  discussed  in  (3l|. 
Tbe  minimum  vertical  parallelism  is  one  processor  and  the  max¬ 
imum  vertical  parallelism  is  up  to  uue  processor  per  instruction. 
Using  one  processer  |M*r  instruction  will  not  only  cause  a  poten¬ 
tial  arehiteeture  to  Iw  prohibitively  expensive,  it  may  rv(|uirv  an 
exorbitant  amount  of  overhead  to  iuipleuient.  Vertical  parallel¬ 
ism  is  not  affected  by  precedence  constraints  because  they  are 
still  enforced;  however,  vertical  parallelism  will  not  reduce  the 
resMnsc  lime.  Thus,  there  are  associated  costs  and  limitations 
with  both  vertical  and  horizontal  parallelism. 

There  are  two  additional  limitations  on  tbe  type  and 
amount  of  parallelism  applied  at  each  level.  The  first  is  that 
there  is  an  upper  bound  on  tbe  cost.  An  additional  limitation  is 
placed  on  (be  type  and  amount  of  parallelism  by  requiring  that 
all  parts  be  “off  the  shelf."  This  second  limitation  forces  the 
architecture  to  be  buildable  with  present  day  technology.  These 
limitations  assume  that  an  algorithm  can  be  structured  for  hor¬ 
izontal  parallel  execution.  If  an  algorithm  is  unsuitable  for  hor¬ 
izontal  parallel  execution,  vertical  parallelism  will  be  required. 

It  is  required  that  there  be  some  form  of  coordination 
between  tbe  levels.  This  can  be  cither  (a)  a  master  system  clock 
that  tells  each  level  when  it  ran  proceed  to  the  next  data  set  or 
(b)  a  unit  that  keeps  track  of  all  levels  and,  when  all  levels  are 
done,  signals  each  to  proceed  to  (he  next  data  set.  A  system 
executing  with  a  master  clock  will  typically  execute  more  slowly 
than  a  system  where  each  level  reports  its  status  to  a  control 
unit.  If  T,  is  the  time  required  for  level  i  to  complete  its  subtask 
given  its  current  data  set,  then  (be  master  clock  cycle  time  T^ 
must  be  set  to  the  maximum  value  of  T,  over  all  levels  for  all 
data  sets.  The  implementation  suggested  in  (b)  for  an  L  level 
system  will  require  an  execution  lime  T,  of: 
T,  =  max(T|,T2,T]  -  -  -  ,Ti,).  There  is  additional  overhead  for 
scheme  (b)  in  terms  of  control  hardware  and  signaling  time. 
Thus,  if  it  is  expected  (bat  there  will  not  be  a  significant 
difference  between  T^  and  T,,  method  (a)  would  be  preferable. 
In  the  extreme  case,  Tc  =  T,.  Normally,  T,  will  be  much  less 
(ban  Tc< 

To  fully  utilize  tbe  hardware  in  the  system,  it  is  desirable 
to  match  tbe  speed  of  all  tbe  levels.  This  can  be  done  in  au_L 
level  system  by  forcing  the  average  response  time  of  level  i,  T„ 


to  be  T,  =  T,,|„/L.  After  tbe  initial  design  (all  levels  designed  to 
perform  their  layer  as  fast  as  possible  with  no  vertical  or  hor¬ 
izontal  parallelism),  tbe  data  processing  rate  of  all  tbe  levels  will 
be  known.  If  the  designed  machine  meets  or  exceeds  the 
throughput  and  response  time  qualifications  of  (he  scenario,  fas¬ 
ter  levels  that  are  ^jacent  can  be  combined.  Faster  levels  can 
also  be  built  with  slower  and  less  expensive  hardware.  This  will 
still  maintain  the  throughput  of  the  system,  however,  the 
response  lime  of  tbe  system  may  be  increased.  Such  a  process 
can  be  repeated  as  long  as  (be  throughput/response  time 
requirements  are  met.  This  will  lower  tbe  cost  of  tbe  overall  sys¬ 
tem. 

To  propose  and  evaluate  candidate  architectures  for  levels, 
a  mapping  is  required  between  a  layer  and  its  corresponding 
level.  Included  in  (bis  mapping  is  the  description  of  (be  layer  iu 
terms  that  relate  it  to  the  roiiiputatiuiial  requireuiruts  (hat  it 
places  on  tbe  hardware.  It  is  this  mapping  (hat  is  the  topic  of 
discussion  in  (he  next  section.  Using  iuformalioii  from  the 
hardware  database  discussed  in  section  2,  the  performance  of 
candidate  architectures  can  be  evaluated  by  some  measure  such 
aa  those  in  |28|. 

After  the  architecture  of  all  the  levels  have  been  proposed, 
the  approximate  performance  of  the  system  is  known.  Simula¬ 
tion  is  required  for  an  exact  evaluation  of  the  performance  of  the 
system.  This  is  required  to  insure  that  the  system  will  perform 
as  desired. 

4.  Nine  Evaluation  Categories: 

Their  Relationship  to  Hardware  and  Software 

When  designing  hardware  for  a  specific  algorithm,  charac¬ 
teristics  of  the  algorithm  must  be  “mapped"  onto  the  hardware. 
To  build  hardware  to  execute  a  given  layer,  a  user  must  supply 
each  of  tbe  of  the  follow iog  evaluolion  parameleri  about  each 
layer  in  the  system. 

(1)  Type,  rale,  and  amount  of  ioput 

(2)  Ty|>e  sud  number  of  operations  per  input  datum 

(3)  Range  and  accuracy  of  arithmetic  data  to  be  used 

(4)  Algorithm  to  be  used 

(fi)  Type,  frequency,  aud  message  length  of  processor-io- 
processor  comm  unications 

(6)  Amount  of  memory  required 

(7)  Type,  amount,  and  bcucCl  of  parallelism 

(8)  Type,  rate,  and  amount  of  output 

(B)  Evaluation  criteria 

These  parameters  form  a  model  of  the  algorithms  io  the  task. 
The  iuformalioD  they  supply  can  be  used  with  the  hardware 
model  of  Section  2  and  tbe  design  scenario  of  Section  3  to 
develop  a  macro-pipelined  architecture  for  the  task. 

Category  (1)  places  restrictious  on  the  input  buffering, 
input  data  rale,  and  (he  iulernal  data  format  of  a  level.  The 
type  of  data  S|H-rifies  the  formal  and  word  width  reejuired  to 
process  the  incoming  data.  Combined  with  the  rate,  (be  type  of 
date  specifies  the  speed  of  the  input  unit.  Between  levels,  either 
doubic-bulferiug  or  (riple-bulleriug  |S|  may  be  used,  i.e  ,  two  or 
three  memory  uuits  may  be  employed  to  between  adjacent  levels 
to  allow  tbe  overlap  of  computation  and  I/O.  If  the  application 
does  not  require  real-time  processing,  then  the  system  must  be 
such  that  the  incoming  data  rate  of  the  first  level  determines  the 
steady  slate  throughput.  For  real-time  applications,  tbe  incom¬ 
ing  data  rale  of  the  (nrsi  level  determines  the  minimum  data  rate 
for  tbe  system.  Tbe  difference  is:  a  non-real-lime  system  can 
slop  tbe  incoming  data  stream  as  necessary;  however,  a  real¬ 
time  system  may  not  be  able  to  stop  the  incoming  stream  of 
data  without  losing  data. 

Evaluation  category  (2)  determines  the  specific  number  of 
operatioDs  that  must  be  performed  by  a  given  level  in  time  T,. 
Prom  one  data  set  to  auotber  the  required  processing  may  vary, 
so  an  exact  statement  of  what  operations  must  be  performed 
may  be  unavailable;  however,  a  reasonable  estimate  may  be  cal¬ 
culated  fur  either  tbe  average  case  or  tbe  worse  case  through 
cither  simulation  lecliuiques  or  statistical  analysis,  as  was  dune 
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ikVrrugr  value  fur  llir  iiuiiiL«-r  of  calrulatioUM  ur  a  worst  rose 
value  uiay  be  used 

The  iiuiulx  r  of  earb  u|h  raliuu  ran  be  uiultiplieJ  by  llir 
correspond  mg  execution  liuie  for  a  single  instruction  (froui  tlie 
first  N-tuple)  If  the  resulling  values  ore  suiuuied  and  niulliplird 
by  the  clorb  speed  of  a  processor,  I  be  worst  rase  execution  tiiue 
can  be  dileriuinrd  for  ibal  processor  'I'o  yield  a  inure  precise 
bound  on  iLe  execution  tiuie  of  a  process  with  processors  that 
coulain  pipelines  or  parallel  units,  either  simulation  or  task 
analysis  sucb  os  that  shown  in  |I0|  must  be  applied  to  the  algo¬ 
rithm 

Classes  of  algoriibnis  of  concern  for  this  parameter  are 
those  algorithms  that  twrforui  the  same  operations  on  each  data 
element  (data  independi  lit )  and  those  algorithms  that  treat 
each  data  element  dillereiilly  (data  dependent)  For  data 
ludependenl  algorii hiiis,  the  number  and  type  of  each  operation 
performed  are  ruuiil.ible  from  the  algorithm.  Fur  a  data  depeu- 
denl  algorii  hill,  the  number  of  operations  can  be  determined 
through  siiiiultliou  on  sample  data  sets  ur,  in  some  cases, 
ihroiigh  analysis  making  certain  assumptions  uIkiuI  the  charac¬ 
teristics  of  the  data  Typically,  data  dependent  algorithms 
require  varying  resources  and  processing  times. 

The  Data  Dependency  lOD)  of  an  algorithm  is: 

_  Dal a’Oeiiendent  Oiierations 

Total  Operations 

and  ran  be  used  as  an  indicator  of  what  percentage  of  the 
expected  execuiiou  lime  is  fixed  (le  ,  data  iiidepeuileut)  and 
what  percentage  may  vary  (i  e  ,  data  dependent).  It  also  indi- 
rales  the  appropriateness  uf  bIMD  ur  MIMD  parallelism. 

Operations  can  be  divided  into  five  groups.  (A)  arithmetic 
and  logic.  (D)  addressing,  (C)  index  calculation  (loop  variables), 
(D)  coudiiioiial,  and  (F)  inter-processor  data  transfer.  These 
classes  ueie  chosen  to  yield  partial  luformaliou  about  which 
operations  can  be  overlapped.  For  example,  on  some  SIMD  sys¬ 
tems.  operations  in  (C)  and  some  in  (D)  can  be  done  in  Ibe  con¬ 
trol  unit,  overlapped  with  (be  parallel  excculiun  of  ihe  rest  uf 
the  oper.vlioi.s  dtiue  by  the  processing  elements.  Information 
abifut  ei.i.s.s  (1.)  liidicalts  how  much  Ihe  iielwork  will  be  used 
Uu  a  syslem  wbire  all  processing  is  done  by  the  same  unit,  the 
disiiuiUon  between  (lie  types  of  operations  is  diminished;  how¬ 
ever.  lu  vuii.siruct  special  purpose  hardware  fur  real-time  pro- 
cessiug  ihe  disliiictiuii  is  useful. 

lufuriiiatiuu  about  the  (A),  (It),  and  ((.')  must  be  further 
sub- divided  to  provide  iiifurm:il ion  necessary  to  choose  suitable 
processing  hardware.  For  example,  (  \)  and  (C)  should  be 
divided  lulu,  fioaiiug  point  additions,  sulilractious,  multiplica¬ 
tions,  divisions,  comparisons,  and  special  functions;  and  fixed 
point  atldiiions,  suLl  raclious,  multiplications,  divisions,  com¬ 
parisons,  and  special  fulicliuiis.  (II)  should  be  divided  into  load 
and  store 

I'be  number  uf  operations  in  r.trb  uf  the  above  sub-groups 
gives  ibe  absoluie  nuuibi'r  uf  each  operation  to  be  done  From 
ibis,  ii  IS  possible  lu  calculate  the  relative  iiiiportance  of  the 
speed  uf  e.icb  operaiiun.  I  or  c:irb  floating  point  ur  fixed  point 
special  fiinciiou,  the  number  of  times  earb  operation  is  expected 
to  be  performed  is  specified  along  with  an  ti/uivalenee  relaliun, 
giving  (hr  number  of  “standard"  |ll|  operations  needed  to 
implemtnt  (he  sprrifird  function  in  software  If  a  unit  eannol 
perform  a  specified  function  directly  in  hardware,  the  lime 
rei|uired  to  synthesize  that  function  (specified  by  the 
equivalence  relation)  must  be  raleulaled  If  a  special  device  (e  g., 
coprocessor  I  is  available  to  perform  ibe  special  functions,  the 
need  fur  imliidiiig  tills  tievirr  ran  lie  delermiurd.  liy  using  this 
approach,  various  units  ran  be  ranked  by  their  execution  s|ieed 
fur  a  given  algorii  Inn 

The  numerical  lange  anrt  accuracy  (3)  places  various  limi- 
l.iliuus  on  the  b.irdware  Typically,  more  accurate  hardware 
(larger  wiirds)  will  be  slower  and/or  more  costly  than  hardware 
with  smaller  worils.  'Fhiis,  it  would  be  advautagous  to  use  (he 
smallest  word  sire  meeting  (be  range  and  arruracy  conslraiiils. 
Floating  point  u|ieralioiis  arc  typically  slower  than  the 
currespundiug  integer  uper.ilious.  lu  certain  cases,  if  the  uuiurri- 


ral  range  reqiiiretl  fur  various  calrulaliuus  is  small,  but  out  uf 
the  range  of  sperilic  hardware,  e  g.,  underflow,  normalization  of 
data  can  eliminate  the  need  for  special  hardware  at  the  cost  of 
some  processing  lime.  The  arithmetic  range  assoriaird  with  a 
SCI  of  operations  greatly  alTects  the  hardware  required  [30|. 
Approaches  to  dy  uamic  word  size  machines,  sucb  as  those  in  |l|, 
|lb{,  and  |I8|,  ran  be  employed  in  cases  where  arithmetic  ranges 
vary  from  loop  to  loop. 

The  numerical  range  and  accuracy  of  a  sub-task  is  a  func¬ 
tion  of  algorithm  and  data.  For  an  algorithm,  it  is  necessary  to 
determine  the  maximum  and  minimum  values  of  the  range  of 
the  caieuiatioiis.  The  range  of  (be  calculations  should  be  divided 
according  to  the  range  of  index  values,  range  of  integer  arith¬ 
metic,  and  the  range  of  floating  point  arithmetic.  This  specifies, 
in  the  SIMD  case,  the  word  size  of  the  control  unit,  and  the  word 
size  of  (he  integer  and  floating  |>uiul  units,  lu  other  cases  the 
word  size  of  the  integer  unit  is  set  according  to  the  maximum 
range  required  for  integer  and  indexing  arithmetic. 

With  knowledge  about  tbe  algorithm  a  level  is  to  process 
(4),  SIMD  and/or  MIMD  horizontal  parallelism  can  be  intro¬ 
duced.  S|M>eial  parallel  analysis  techniques,  such  as  those  di*- 
cussed  in  |3|.  |I0|,  and  |2'J|  can  be  employed  to  utilize  "extra” 
parallelism.  This  ran  be  accomplished  by  breaking  tbe  algorithm 
down  into  multiple  streams,  using  MIMD  parallelism.  Applica¬ 
ble  loops  are  those  containing  variables  that  can  be  calculated 
independently  uf  other  variables  within  the  loop.  The  “break¬ 
down”  occurs  when  a  variable  cau  be  extracted  from  a  loop  and 
calculated  in  a  separate  environment  (either  a  different  proces¬ 
sor  or  processors)  l3j.  Other  techniques  for  parallel  processing 
such  as  the  use  of  ^‘recursive  doubling”  for  calculating  sums  or 
maximums  [32]  using  SIMD  or  MIMD  parallelism  can  be 
applied. 

The  algorithm  is  required  to  obtain  liming  information 
from  the  previously  discussed  N-tuples  describing  (he  hardware 
database.  i)y  multijilyiug  the  number  of  each  type  of  operation 
by  tbe  corresponding  operation  lime,  an  upper  bound  on  tbe 
exerulion  time  can  be  obtained.  Tbe  algorithm  must  be 
scauned  to  dclermiuc  what  perreotage  of  the  operations  ran  be 
pipelined  aud/ur  uvcrlapped.  This  must  be  done  fur  e:vch  proees- 
sor  in  (be  database  After  (be  amount  of  time  saved  by  tbe 
parallelism  and  pipelines  is  determined,  this  lime  is  then  sub¬ 
tracted  from  the  execution  time  for  the  processor.  For  systems 
with  reconfigurable  pipelines,  (be  reconfiguration  time  must  be 
multiplied  by  Ibe  number  of  reconfigurations  required  by  the 
algorithm. 

Dy  deriving  bouuds  on  execution  lime  as  described  in  ||3|, 
levels  requiring  large  amounts  of  time  can  be  analyzed.  This  will 
indicate  where  each  level  is  spending  its  execution  lime.  If  con¬ 
sistent  variable  names  are  used  from  layer  to  layer,  similar  task 
decomposition  to  the  above  can  be  applied  across  levels  to  allow 
the  combination  and/or  sub-division  of  levels  as  needed.  Con¬ 
sider  the  scenario  in  Fig.  1.1.  If  level  three  calculates  a,  b,  and  c 
iodepcadenl ly  of  (be  output  of  level  two,  and  the  throughput  of 
level  three  is  too  low,  tbe  portion  of  the  algorithm  calculating 
a,b,  and  c  can  be  moved  to  level  2.  If  this  makes  the  throughput 
uf  level  two  too  low.  a  separate  unit  cau  be  employed  for  the  cal¬ 
culations  The  result  is  shown  on  the  right  of  the  figure. 

The  type,  freiiuency,  and  message  length  of  the  proeessor- 
lo-processor  communications  within  a  layer  (5)  will  dictate  tbe 
topology  of  a  level  and  the  design  of  the  interconnection  net¬ 
work.  There  arc  two  types  of  inlercounection  networks.  A  glo¬ 
bal  lUtcrroDneclioD  network  allows  a  given  processor  to  com¬ 
municate  directly  with  any  other  given  processor  within  a  given 
boritoulally  parallel  slrueture  (e  g.,  SIMD  or  MIMD  portion  of 
macbioc).  Typically  a  multistage  arrangement  is  used  for  such  a 
network  (2o[  faltbougb  it  docs  not  permit  al  possible  SIMD  data 
permutations).  The  second  type  of  inlerconoection  network  is 
local  iotcrronnecliuD  network,  whirh  allows  a  processor  to  com¬ 
municate  with  a  specific  uumiMT  of  its  neighbors  (e  g.,  1-  or  8- 
nearest  neighbors)  |3U|  and  |33j.  In  ibis  case,  the  processors  can 
be  viewed  as  cither  a  one,  two,  or  three  dimensional  array  when 
determining  tbe  connect  ions  to  be  made  by  the  network.  A  net¬ 
work  must  lie  capable  uf  making  (he  desired  conoerlions 
efficiently  and  with  uiiiiiiual  collisions,  to  avoid  significant 
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Fit-  4.1  Sctaario  before  ud  alter  applicatioa  of  techaiquea  ia  |13| 


delays  durinf  transfers.  It  would  be  desirable  to  have  a  data¬ 
base  of  known  global  connection  networks  and  the  permutations 
that  they  can  perform,  so  an  appropriate  connection  network 
can  be  chosen. 

From  the  type  of  communications  required  by  a  layer, 
information  can  be  gained  atmut  ibe  type  of  processing  that 
should  take  place  on  a  given  level,  i.e.,  ibe  more  random  the 
communications,  the  more  likely  that  a  horizontally  parallel 
level  should  use  MIMD  (asynchronous)  parallelism,  as  opposed 
to  SIMD  (synchronous)  parallelism.  Knowing  the  length  of  the 
transfers  will  aid  the  design  of  the  network.  For  instance,  the 
longer  the  transfers,  the  more  suitable  a  circuit  switched  net¬ 
work  becomes.  For  small  transfers,  a  packet  switched  network 
is  desirable.  The  number  of  network  transfers  and  the  length  of 
the  average  transfer  provides  information  about  the  loading  of  a 
network  with  a  given  transfer  speed. 

Determination  of  the  type  and  amount  of  proccssor-lo- 
processor  communication  for  a  highly  data  indeimndent  task  is 
straightforward  and  can  be  obtained  from  analysis  of  the  paral¬ 
lel  structure  of  the  algorithm  statement  in  (4).  For  data  depen¬ 
dent  tasks,  the  required  transfers  may  vary  in  length  and  con¬ 
nection,  dependent  solely  on  the  data  set  being  processed. 
Simulation  may  be  required  to  achieve  accurate  estimates.  To 
minimize  the  need  for  simulation,  analysis  of  the  data  set  can 
yield  information  about  the  required  connections.  For  example, 
if  a  process  performs  edge  tracing  on  an  image  containing  small 
objects  (relative  to  the  image  site),  global  connections  are  not 
required,  only  local  (nearest  neighbor)  connections  are  needed 
|34y  If  the  objects  are  large,  then  global  connections  may  be 
neMed. 

Memory  size  (0)  is  an  important  factor  in  the  design  of  a 
system  and  is  a  function  of  the  proposed  data  set  size,  data  type, 
and  algorithm.  The  data  set  size,  processors  in  a  level,  and  algo¬ 
rithm  chosen  have  an  important  bearing  on  how  much  memory 
is  associated  with  a  processor  in  a  given  level.  This  will  be  con¬ 
sidered  in  addition  to  the  buffer  memory  associated  with  a  given 
level. 

Memory  usage  falls  into  three  classes:  program,  stack,  and 
data  memory.  Program  memory  is  not  determinable  from  the 
algorithm,  although  an  estimation  is  possible.  It  is  a  function  of 
the  machine  and  the  compiler.  The  stack  memory  contains 
mrguments  to  subroutines,  return  addresses,  and  temporary 
infurmatioii  It  is  a  fuiirtiun  of  the  nesliiig  of  sulirouliii>"i,  along 
with  the  iiifuruiatioii  that  is  passed  to  the  subroutines.  For  data 
dependent  recursive  algorithms,  simulation  may  be  required  to 
determine  the  appropriate  amount  of  slack  memory  needed.  An 
alternative  to  simulation  is  to  place  a  uiaximum  depth  (in  terms 
el  calls  to  BMciCc  functions)  on  the  stack.  If  each  specific  func¬ 
tion  is  called  with  given  arguments  (each  with  a  given  size),  cal¬ 


culation  of  the  stack  size  is  straightforward,  based  on  the  max¬ 
imum  number  of  calls  times  the  space  needed  for  each  call. 

The  data  memory  site  ia  composed  of  the  index  memory 
size  and  the  process  data  memory  size,  where  the  index  memory 
is  the  memory  required  to  store  loop  counters  and  some  index 
variables.  The  process  data  memory  is  the  memory  used  to 
store  the  parameters,  working  data,  intermediate  results,  and 
index  variables  that  could  not  be  stored  in  the  CU  of  an  SIMD 
system.  For  a  data  indepeudenl  task,  data  set  size  is  trivial  to 
determine  from  a  algorithm.  A  data  dependent  task  may 
require  sinsnlation. 

The  particular  divisions  of  memory  stem  from  where  the 
data  must  be  accessed.  In  an  SIMO  environment,  tbe  slack, 
index  memory,  and  program  memory  must  be  associated  with 
the  control  unit,  while  the  process  data  must  be  accessible  by 
the  processing  elements.  In  other  environments,  this  memory  is 
associated  with  the  processor,  so  tbe  divisions  do  not  matter  so 
much  as  their  total. 

The  type  and  amount  of  parallelism  (7)  will  specify  the 
nature  and  maximum  number  of  processors  associated  with  a 
given  level.  The  benefit  due  to  parallelism  is  specified  in  two 
areas:  (I)  speedup  due  to  P  processors  and  (11)  tbe  maximum 
value  of  P. 

Tbe  type  of  parallelism  is  a  function  of  the  algorithm.  Cer¬ 
tain  algorithms  may  be  written  for  an  SIMO  machine,  thus 
SIMD  parallelism  should  be  used.  For  a  general  algorithm, 
determining  whether  an  algorithm  is  best  suited  for  a  specific 
environment  can  be  done  by  looking  at  tbe  DD,  as  discussed 
above.  For  a  typical  parallel  algorithm,  the  lower  the  DD,  the 
more  likely  an  algorithm  is  suited  to  SIMD  type  processing. 
Typically,  MIMI)  parallelism  is  more  Hexible:  however,  SIMD 
paralelism  has  the  advantages  of  built-in  sync hrunizal ion  and 
the  ability  to  overlap  CD  control  operations  with  processing  ele¬ 
ment  instruction  execution. 

The  amount  of  parallelism  ran  be  determined  by  several 
criteria.  Typically,  tbe  larger  the  number  of  processors,  tbe  less 
processing  each  processor  must  perform  and  toe  more  significant 
transfer  and  wail  times  become.  As  transfer  and  wait  times 
income  more  significant,  the  processors  will  spend  a  larger  por¬ 
tion  of  time  idled,  so  the  utilization  of  a  processor  will  decrease. 
A  Variety  of  performance  measures  are  discussed  in  (78l.  These 
can  be  used  to  determine  tbe  relative  benefit  of  each  additional 
processor,  allowing  one  to  calculate  the  number  of  processors 
associated  with  a  given  level. 

The  speedup  due  to  P  processors  (I)  can  be  obtained  by 
analyzing  the  algorithm.  This  figure  can  be  used  to  determine 
tbe  decrease  in  response  time  by  using  I*  processors.  The  max¬ 
imum  value  of  P  (II)  is  tbe  ceiling  on  tbe  amount  of  parallelism. 
This  represents  tbe  maximum  amount  of  iuberenl  parallelism  in 
a  given  task  and  can  be  calculated  by  analyzing  the  task.  For 
both  I  and  II,  data  dependent  tasks  simulation  may  be  required. 

Knowledge  of  the  type,  rate,  and  amount  of  output  (8)  will 
be  required  fur  any  formatting  that  must  be  done  to  interface 
tbe  data  to  tbe  device  gathering  tbe  results.  In  addition,  it 
places  constraints  on  the  output  data  rale. 

Finally,  the  evaluation  criteria  (0)  define  bow  the  merit  of  a 
system  is  to  be  calculated.  Here,  (be  evaluation  criteria  will  be 
speed  and  cost,  i.e.,  tbe  execution  must  occur  in  real-time  for  tbe 
minimum  cost.  For  non.  real-time  systems,  other  criteria  such  as 
those  considered  in  |8|  and  (28|  may  be  used,  e  g.,  efficiency,  util¬ 
ization,  and  power  consumption.  By  iucorporaling  (he  evalua¬ 
tion  criteria  into  the  design  procedure,  proposed  designs  not 
meeting  tbe  evaluation  criteria  can  be  avoided.  In  addition,  this 
provides  a  way  to  rank  various  designs. 

fi.  Example  of  Approach 

Consider  ibr  uppliraliuu  uf  the  nine  paranit'Irrs  to  a  task 
such  as  Dynamic  Time  VV.tr|>iiig  (DTW  ),  nbirh  is  perfuruied  lu 
speech  processing.  This  algorithm  warps  lucouiiug  ulterauccs  to 
fiud  tbe  best  watch  in  a  list  of  templates  of  known  words.  It 
represents  the  most  cumpulalioually  intensive  puriiou  of  the 
proposed  speech  processing  scenario  and  corresponds  to  one 
layer  of  the  task  (Fig.  I  .l). 


kold  -  00  ;  tempUtc  =0;/*  iai(i»lii»tion  •/ 
for  k  =  I  to  10000  (  /*  for  each  template  •/ 
forj  ~  -t  to  1  {  /•  iaitialitatioa  •/ 
for  i  3:  -I  to  1  { 

4iij)=®. 

)  /•  eod  i  •/ 

)  /•  end  j  »/ 

for  j  -  1  to  80  (  /•  for  each  frame  ia  a  aod  b|k|  •/ 

for  i  =  j-r  to  j  -t-  r  (  /*  each  frame  withia  wiadow  */ 
if(i<0)  i  =  1;  /*  force  i  to  be  valid  */ 
if(i>80)  i  =  j-r  +  l; 
else  { 

d|i||il=0; 
for  h  =  I  to  0  { 

/•  compute  'dialaace'  betweea 
frames  a|i|  aad  b|k|(j|  */ 

<»|i||l|=  <«|i||j|  + 

(a|il|hl-hlkl(j||hl)«2; 

)  /•  cod  b  •/ 

=  n>iii(g|i- 1||  J- 1)  +  2d|i||i|, 
*|.-ll|j-2]+2dl.)|j-l)+d|.)|j|. 

»|.-2llj-ll  +  2d|.-»||j|+d|.l|,|(; 

)  /• eod  i  •/ 

)  /•  ead  j  •/ 

D(».b|k|)  =  ||80||80|; 

if(D(a,b|k|)  <  hold)  {  /•  store  mioimuin  value  •/ 
kold  =  D(a,t|k|); 
template  =  k; 

)  /•  eod  if  •/ 

}  /•  eod  k  •/ 

a  •  unkoowo  word  (UW) 

a|i|  •  frame  i  of  UW 

a|i||h|  -  clemcot  h  of  vector  dcsctibiug  frame  i  of  UW 

b|kj  •  refereoce  word  k  (KWK) 
bfk||i|  -fiaiiieiorRWK 

b|‘jj'i|bl  -  elcmeiil  h  of  vector  describiag  frame  i  of  UWK 
D(a.b|k|)  ■  distance  between  UW  and  ItWK 

l(ij|  -  cumulative  distaoce  betweea  a  aod  b|k| 

bold  •  distaoce  number  of  best  fitting  reference  word 

template  -  number  of  best  fitting  refereoce  word 

Fig.  5.1.  Sample  DTW  algorithm 

A  D'lW  algorithm  is  shown  ia  Fig.  S.l  |37|.  The  input  to 
this  algorithm  consists  of  80  frames  of  speech,  each  represented 
by  vector  of  nine  Ift-bit  integers  (80).  There  will  be  one  Itt-bit 
quantity  used  to  identify  each  word.  Assume  there  are  10,000 
templates  in  the  database  (meaning  the  system  can  understand 
1000  words  since  ten  templates  are  required  for  each  word  |25|). 
The  variable  "r"  is  the  amount  the  algorithm  will  be  allowed  to 
warp  the  incoming  template.  For  the  purposes  of  this  paper, 
r=3.  The  nine  evaluation  categories  are  as  follows; 

/.  Type,  rate,  and  omounf  «/ input 
Type:  Fixed  point  data 

Rate:  I  utterance/ 1  0  second 

Amount:  720  fixed  point  uuinbers/ultfrauce 

//.  Type  and  number  of  opeToliotuf  input  datum 
O.BM  index  variable  assignments 
O.IM  index  variable  additions 
00. 1 M  index  variable  additions  (  + 1) 

07.3M  index  variable  conditional  branches 

132.7M  address  calculations 

105. SM  fixed  point  additions 

5.8M  fixed  point  assignments 

II.3M  Gxea  point  conditional  branches 

00.7M  fixed  point  multiplications 

00.7M  fixed  point  subtractions 


///.  Ronge  and  accuracp  of  arithmetic  data 
*2“,  il 

fV.  Type,  amount,  ond/requeney  ofpreeenor-to-proeeeoor 
remtnuniraUon 

Type;  Global,  capable  of  recursive  doubling  |32| 

Amount:  3  fixed  point  numbers 
Preq.:  2  iogjP  transfers  per  second 

V.  Amount  of  memory  required 

Memory;  (U. 5/P) +0.01  Mbytes  of  data  per  processor  for 
reference  (template)  and  incoming  utterance  storage 
10  Kbytes  of  program  aod  stack 
Note;  one  copy  of  the  program  is  required  per 
prMessor  for  MIMD  machine;  one  copy  in  the  control 
unit  for  SIMD  machine. 

VI.  Type,  amount,  and  benefit  of  parattelitm 
Type:  SIMD  or  MIMD 

Amount  (max); 

10,000  (utterance  in  database) 

Benefit: 

T 

speedup  = - - - ^ - 

7  +  [(log2P)](IC  +  2xNO) 

where  a  single  processor  takes  time  T,  1C  is  tbe  time 
for  an  integer  comparison,  and  NO  is  the  time  for  a 
network  operation. 

Vll.  Alqoritkm  to  be  used 

Algorithm:  See  Pig.  5.1. 

VIU.  Type  rale  and  amount  of  output 
Type;  1  Cnglisb  word 
Rate:  1  per  second 

Amount:  100  characters  maximum  (arbitrary) 

IX.  Evaluation  criteria 
Speed  and  cost 


These  nine  evaluation  categories  represent  an  analysis  of 
tbe  algorithm.  Evaluation  category  II  is  directly  determinable 
from  tbe  algorithm.  The  range  and  accuracy  is  determinable 
from  tbe  application.  |24|  states  that  2'*  ±1  is  a  reasonable 
range  and  accuracy  for  this  task.  To  apply  a  parallel  machine  to 
this  algorithm,  each  processing  element  would  need  to  execute 
this  algorithm  on  its  own  portion  of  the  template  database  com¬ 
puting  a  local  D(a,b|k)).  Recursive  doubling  |37|  would  then  be 
used  to  combine  the  results;  i.e.,  the  word  associated  with  the 
smallest  d(a,b|k|)  is  tbe  chosen  word.  This  requires  2log2P 
transfers  for  the  d(a,b|k|)'s  and  the  identifiers  for  their  associ¬ 
ated  words. 

The  amount  of  memory  ia  expressed  as  a  function  of  P,  tbe 
number  of  processors.  A  "C"  language  propsm  was  coded  and 
compiled  to  estiiiialc  the  program  site.  The  DD  is  small,  so 
either  8IMD  or  MIMD  parallelism  can  be  applied  to  the  pro¬ 
gram;  however,  tbe  maximum  parallelism  is  10,000  processors, 
assuming  each  I’E  executes  tbe  algorithm  for  one  or  more  tem¬ 
plates.  Application  of  P  processors  will  yield  the  speedup  shown 
in  VI.  The  output  of  this  system  is  one  word.  It  is  imperative 
that  tbe  system  keep  up  with  the  input;  however,  it  is  de^rable 
to  do  such  with  a  minimal  coat. 

The  number  of  each  calculation  can  be  multiplied  by  the 
sin^operaud  execution  times  of  the  tuples  for  each  processor 
ia  the  database.  The  sum  of  the  products  yields  an  approximate 
worst-case  execution  time  for  a  single  copy  of  each  processor  ia 
tbe  database  to  perform  this  algorithm.  Actual  execution  time 
could  be  better  due  to  clever  software  or  special  hardware  func- 


tiou.  For  exunple,  software  that  is  written  to  ignore  redun¬ 
dant  calculations  (e.g.,  calculating  the  address  of  b[i||k|  only 
once  in  the  ex^ession:  h(j]|k|=b|||lk)  +  S|.  Also,  by  applying 
pipeline  analysis  techniques  to  this  algorithm  and  using 
structural  information  about  each  processor,  such  as  functional 
overlap,  stages  in  processing  pipelines,  and  the  multiplicity  of 
units,  a  more  precise  approximation  of  the  single  processor  exe¬ 
cution  times  can  be  obtained. 

Based  on  the  desired  throughpt  and  response  time,  addi¬ 
tional  processors  of  tbe  same  type  are  repetitively  added  until  a 
level  composed  of  such  processors  could  meet  the  time  requirc- 
mcnls.  The  number  of  processors  is  then  multiplied  by  the  cost 
of  the  associated  hardware.  To  this  amount,  the  price  of  other 
devices,  such  as  memory  and  inter-processor  communications 
links,  is  added  to  approximate  the  cost  of  tbe  processing 
hvdware  involved.  Tbe  processor  chosen  used  for  tbe  design 
will  be  chosen  based  on  the  least  expensive  hardware. 

Consider  tbe  application  of  a  Motorola  08000  |20|  to  tbe 
above  task.  The  tuples  enumerating  the  operations  and  their 
respective  times  contains  over  1000  instructions:  a  partial  list  is 
included  for  brevity: 

(add  r,#;add  rl,T2;add  (a)'i',r;cond.  branch;  mov  r,#;mov 
r,(a);mov  #,(a);mul  r],r2;  mul  (a)'t',r;  sub  r,#;sub  rl,r2;aub 
(»)  +  .») 

where  r  stands  for  register,  #  stands  for  immediate,  (a)  stands 
for  memory  location  stored  in  register  “a",  (a)'t'  stands  for 
memory  location  stored  in  register  “a"  followed  by  incrementing 

hi 

The  tuple  describing  the  timings  (in  cycles)  is: 

(8,4,8,IO(true)/8(false),8, 12,12,70,74,8,4,8) 

The  08000  has  no  functional  overlap  or  pipelining  other  than  a 
6ve  stage  instruction  decoder.  These  tuples  will  be  omitted.  A 
08000  has  no  special  address  calculation  hardware,  so  a  two- 
dimensional  address  calculation  requires  loading  a  register,  mul¬ 
tiplying  by  a  memory  location,  and  tbe  addition  of  two  memory 
locations.  Assuming  that  the  index  variables  are  stored  in  regis¬ 
ters  and  that  fixed  point  numbers  are  stored  in  ntemory,  a  12.$ 
MHt  08000  would  take  IS70  seconds  to  perform  dynamic  time 
warping  on  a  single  word.  Using  a  multistage  cube  network  that 
takes  1.0  msec  for  two  transfers,  1000  processors  in  MIMD  mode 
would  take  .098  seconds  to  perform  dynamic  time  warping.  (A 
thorough  analysis  should  consider  tbe  overlap  of  CU  and  PE 
operations  in  SIMD  mode;  c.g.,  address  calculations).  Dynamic 
time  warping  is  normally  done  with  fewer  than  100  reference 
templates  because  of  its  great  computational  complexity. 

Such  an  analysis  would  be  required  for  each  processor  in 
tbe  database.  Then,  an  actual  implemeutatioo  of  the  above 
approach  would  consider  simulating  the  algorithm  on  the  vari¬ 
ous  processors  to  obtaiu  a  more  accurate  liming  estimation. 
Finally,  if  no  processor  in  the  database  could  be  used  to  imple¬ 
ment  this  algorithm,  tbe  layer  would  need  to  be  broken  down 
into  sub  layers,  each  of  which  would  he  analyied  with  tbe  pro¬ 
posed  techniques. 

fl.  Conclusions 

Using  the  above  nine  categories,  an  algorithm  can  be 
analysed  according  to  what  requirements  it  places  on  a  system. 
If  many  hardware  components  are  analysed  and  calegoriied 
accordin|  to  abilities  and  processing  limes,  a  library  containing 
information  about  these  processors  can  be  built.  By  using  these 
models  of  algorithms  and  hardware  to  map  the  organisation  of 
each  level  in  a  multi-level  design  to  a  layer  of  software,  comput¬ 
ers  can  be  used  to  aid  in  tbe  design  of  systems  for  the  specific 
needs  of  algorithms,  thus  making  possible  computer  assisted 
design  of  special  purpose  parallel  architectures. 

In  summary,  this  was  a  study  of  one  approach  to  model  the 
design  of  a  class  of  macro-pipelined  parallel  architectures. 
Categories  of  hardware  analysis  were  presented.  Their  relation¬ 
ship  to  the  hardware  requirements  and  ibeir  dependence  on  the 
algorithm  to  be  performed  was  discussed.  An  example  of  the 
application  of  tbe  parameters  and  tuples  was  shown.  By  study¬ 


ing  approaches  to  bridging  tbe  gap  between  hardware  and  algo¬ 
rithms,  computer  aided  special  purpose  machine  design  comes 
closer  to  being  a  reality. 

Atknou4tdi/ementt:  The  authors  of  this  paper  wish  to 
extend  their  deepest  thanks  to  M.  Yoder,  without  whom  the  dis¬ 
cussion  on  tbe  isolated  word  recognition  system  could  not  have 
been  done  in  real-time,  and  to  M.  Franklin,  G.J.  Lipovski,  P. 
Swain,  and  A.  van  Tilborg,  whose  careful  readings  and  com¬ 
ments  helped  to  organiie  and  clarify  (be  ideas  in  this  paper. 
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